![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Rule Based NER and Assertion Models
This notebook demonstrates the use of rule-based Named Entity Recognition (NER) combined with assertion detection models for structured information extraction from text.
Here we demo a series of Text Matcher models, each designed to identify and extract entities of interest, such as states, cities, and drugs, using pre-defined dictionaries and linguistic patterns. By applying these targeted matchers, we can ensure high precision in entity identification, especially in specialized contexts where standard models may underperform.

Beyond entity detection, the notebook also integrates Contextual Assertion Models, which determine the status of an entity in context. For example, whether a drug is mentioned as being possibly used (Detect Possible Assertion) or conditionally prescribed (Detect Conditional Assertion).


## **🎬 Colab Setup**

In [None]:
# import johnsnowlabs library
!pip install -q johnsnowlabs

In [None]:
# Upload license key for healthcare NLP
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical
nlp.settings.enforce_versions=True
nlp.install(refresh_install=True)

In [None]:
# import Spark NLP and Spark NLP for Healthcare from  johnsnowlabs library
from johnsnowlabs import nlp, medical

nlp.install()

In [None]:
# import required modules
from sparknlp.base import *
from pyspark.ml import Pipeline

spark = nlp.start()

# 🔎 MODELS
Models used in this pipeline and the entities they extract.

| Index | Model | Entities |
|---:|:------------------------|:-|
|  1 | [country_matcher](https://nlp.johnsnowlabs.com/2024/10/23/country_matcher_en.html) | Country |
|  2 | [state_matcher](https://nlp.johnsnowlabs.com/2024/09/11/state_matcher_en.html) | State |
|  3 | [city_matcher](https://nlp.johnsnowlabs.com/2024/07/02/city_matcher_en.html) | City |
|  4 | [drug_matcher](https://nlp.johnsnowlabs.com/2024/03/19/drug_matcher_en.html) | Drug |
|  5 | [biomarker_matcher](https://nlp.johnsnowlabs.com/2024/03/06/biomarker_matcher_en.html) | Biomarker |
|  6 | [cancer_diagnosis_matcher](https://nlp.johnsnowlabs.com/2024/06/17/cancer_diagnosis_matcher_en.html) | Cancer_dx |
|  7 | [contextual_assertion_conditional](https://nlp.johnsnowlabs.com/2025/03/12/contextual_assertion_conditional_en.html) | Conditional |
|  8 | [contextual_assertion_possible](https://nlp.johnsnowlabs.com/2025/03/12/contextual_assertion_possible_en.html) | Possible |
|  9 | [contextual_assertion_someone_else](https://nlp.johnsnowlabs.com/2025/03/12/contextual_assertion_someone_else_en.html) | Someone_else |
| 10 | [contextual_assertion_absent](https://nlp.johnsnowlabs.com/2025/03/12/contextual_assertion_absent_en.html) | Absent |
| 11 | [contextual_assertion_past](https://nlp.johnsnowlabs.com/2025/03/12/contextual_assertion_past_en.html) | Past |

# Rule-based Pipeline with Separated Entity Processing
This pipeline combines rule-based NER models and assertion models

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

country_matcher = medical.TextMatcherModel.pretrained("country_matcher", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("country")\
    .setMergeOverlapping(True)

state_matcher = medical.TextMatcherModel.pretrained("state_matcher", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("state")\
    .setMergeOverlapping(True)

city_matcher = medical.TextMatcherModel.pretrained("city_matcher", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("city")\
    .setMergeOverlapping(True)

drug_matcher = medical.TextMatcherModel.pretrained("drug_matcher", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("drug")

biomarker_matcher = medical.TextMatcherModel.pretrained("biomarker_matcher", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("biomarker")

cancer_diagnosis_matcher = medical.TextMatcherModel.pretrained("cancer_diagnosis_matcher", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("cancer_dx")\
    .setMergeOverlapping(True)

# Merge all NER entities
chunk_merger = medical.ChunkMergeApproach()\
    .setInputCols(["drug", "biomarker", "cancer_dx","country", "state", "city"])\
    .setOutputCol("ner_chunk")\
    .setSelectionStrategy("Sequential")

# Merge clinical entities (for assertions)
clinical_merger = medical.ChunkMergeApproach()\
    .setInputCols(["drug", "biomarker", "cancer_dx"])\
    .setOutputCol("clinical_entities")\
    .setSelectionStrategy("DiverseLonger")\
    .setOrderingFeatures(["ChunkLength"])

# Assertion models (only for clinical entities)
contextual_assertion_conditional = medical.ContextualAssertion.pretrained("contextual_assertion_conditional", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "clinical_entities"])\
    .setOutputCol("assertion_conditional")

contextual_assertion_possible = medical.ContextualAssertion.pretrained("contextual_assertion_possible", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "clinical_entities"])\
    .setOutputCol("assertion_possible")

contextual_assertion_someone_else = medical.ContextualAssertion.pretrained("contextual_assertion_someone_else", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "clinical_entities"])\
    .setOutputCol("assertion_someone_else")

contextual_assertion_absent = medical.ContextualAssertion.pretrained("contextual_assertion_absent", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "clinical_entities"])\
    .setOutputCol("assertion_absent")

contextual_assertion_past = medical.ContextualAssertion.pretrained("contextual_assertion_past", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "clinical_entities"])\
    .setOutputCol("assertion_past")

assertion_merger = medical.AssertionMerger()\
    .setInputCols(["assertion_conditional", "assertion_possible", "assertion_someone_else", "assertion_absent", "assertion_past"])\
    .setOutputCol("clinical_assertions")\
    .setMergeOverlapping(True)\
    .setSelectionStrategy("sequential")\
    .setAssertionSourcePrecedence("assertion_conditional, assertion_possible, assertion_someone_else, assertion_absent, assertion_past")\
    .setCaseSensitive(False)

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    country_matcher,
    state_matcher,
    city_matcher,
    drug_matcher,
    biomarker_matcher,
    cancer_diagnosis_matcher,
    chunk_merger,
    clinical_merger,
    contextual_assertion_conditional,
    contextual_assertion_possible,
    contextual_assertion_someone_else,
    contextual_assertion_absent,
    contextual_assertion_past,
    assertion_merger
])


sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
country_matcher download started this may take some time.
Approximate size to download 10.2 KB
[OK!]
state_matcher download started this may take some time.
Approximate size to download 6.1 KB
[OK!]
city_matcher download started this may take some time.
Approximate size to download 180.3 KB
[OK!]
drug_matcher download started this may take some time.
Approximate size to download 5.5 MB
[OK!]
biomarker_matcher download started this may take some time.
Approximate size to download 25.6 KB
[OK!]
cancer_diagnosis_matcher download started this may take some time.
Approximate size to download 42.8 KB
[OK!]
contextual_assertion_conditional download started this may take some time.
Approximate size to download 1.3 KB
[OK!]
contextual_assertion_possible download started this may take some time.
Approximate size to download 1.7 KB
[OK!]
contextual_assertion_someone_else download 

# Fit the Pipeline

In [None]:
empty_data = spark.createDataFrame([[""]]).toDF("text")
fitted_pipeline = pipeline.fit(empty_data)

# Sample Text to Tryout the Pipeline

In [None]:
sample_texts = [
    """Name: Patel, Rina  Record Date: 2095-11-03  MR: 781093
Dr. Sofia Chen, IP: 172.16.254.12
She is a 48-year-old female admitted to Unity Health Institute in Toronto
for thyroidectomy on 11/03/95.
Patient's VIN: JH4KA8270MC012345, SSN: 333-22-7777, Driver’s License: P987654F
Phone: +1 (647) 555-1122, Address: 789 Queen Street, Toronto, Canada,
Email: rina.patel@caremail.org
In the past 18 months, the patient has traveled to India, Germany, Brazil,
South Korea, Morocco, and Australia for both business and leisure.
She reported brief stays in Mexico City and Cairo as well.
All travel occurred prior to surgery, and she denied any symptoms during or after her trips.""",

    """Patient Summary Report
Name: Green, Thomas L.  DOB: 08/14/2040  Sex: Male  MRN: 559882
Date of Encounter: 2094-08-30  Facility: St. Margaret’s Medical Center, Atlanta, Georgia
Physician: Dr. Rebecca Allen, MD – Internal Medicine
Chief Complaint: Persistent abdominal pain and fatigue for 2 weeks.
History of Present Illness: Mr. Green is a 54-year-old male who presented
to the emergency department in Atlanta, GA, with abdominal discomfort
described as a dull ache localized to the left lower quadrant.
He reports the pain began during a work trip to Texas and progressively
worsened while traveling through Nevada and Illinois.
The patient states he had similar episodes in the past during visits to Florida,
but those resolved spontaneously. He recently returned from a family reunion
in New York, where he experienced nausea and loss of appetite.""",

    """Name: Laura Martinez  Record Date: 2094-06-15  MR: 927384
Dr. Anthony Kim, IP: 10.0.0.45
She is a 62-year-old female admitted to Metropolitan Medical Center
in San Francisco for a knee replacement on 06/15/94.
Patient's VIN: 1N4AL11D75C678901, SSN: 555-66-9999, Driver's license no: D456321K
Phone: (415) 555-6723, 1122 Pine Avenue, Chicago, IL, USA,
E-mail: laura.martinez@healthmail.org
Patient has traveled to Rome, Dubai, and Cape Town in the past 12 months.""",

    """Maria’s physician prescribed clopidogrel for her cardiovascular risk,
along with ibuprofen for muscle pain, azithromycin for her sinus infection,
and omeprazole to manage her acid reflux on 2024-07-18.""",

    """In the bone marrow (BM) aspirate, blasts comprised 91.3% of nucleated cells,
expressing CD10, CD19, CD34, CD45, CD117, CD123, HLA-DR, and TdT by flow cytometric analysis.
Serum tumor marker evaluation revealed elevated levels of carcinoembryonic antigen (CEA: 6.42 ng/mL),
alpha-fetoprotein (AFP: 11.75 ng/mL), and pro-gastrin-releasing peptide (ProGRP: 85.3 pg/mL).""",

    """A 65-year-old woman had a history of debulking surgery, bilateral oophorectomy with omentectomy,
total anterior hysterectomy with radical pelvic lymph nodes dissection due to ovarian carcinoma
(mucinous-type carcinoma, stage Ic) 1 year ago. The patient's medical compliance was poor and failed
to complete her chemotherapy (cyclophosphamide 750 mg/m2, carboplatin 300 mg/m2).
Recently, she noted a palpable right breast mass, 15 cm in size which nearly occupied the whole right breast
in 2 months. Core needle biopsy revealed metaplastic carcinoma.
Neoadjuvant chemotherapy with the regimens of Taxotere (75 mg/m2), Epirubicin (75 mg/m2),
and Cyclophosphamide (500 mg/m2) was given for 6 cycles with poor response,
followed by a modified radical mastectomy (MRM) with dissection of axillary lymph nodes and skin grafting.
Postoperatively, radiotherapy was done with 5000 cGy in 25 fractions.
The histopathologic examination revealed a metaplastic carcinoma with squamous differentiation
associated with adenomyoepithelioma.
Immunohistochemistry study showed that the tumor cells are positive for epithelial markers
(cytokeratin AE1/AE3), and myoepithelial markers, including CK 5/6, p63, and S100.
Expressions of hormone receptors, including ER, PR, and Her-2/Neu, were all negative.""",

    """Patient has a family history of diabetes. Father diagnosed with heart failure last year.
Sister and brother both have asthma. Grandfather had cancer in his late 70s.
No known family history of substance abuse. Family history of autoimmune diseases is also noted.""",

    """Patient resting in bed. Patient given azithromycin without any difficulty.
Patient has audible wheezing, states chest tightness.
No evidence of hypertension. Patient denies nausea at this time. Zofran declined.
Patient is also having intermittent sweating associated with pneumonia.""",

    """The patient presents with symptoms suggestive of pneumonia, including fever, productive cough,
and mild dyspnea. Chest X-ray findings are compatible with a possible early-stage infection,
though bacterial pneumonia cannot be entirely excluded.""",

    """The patient reports intermittent chest pain when engaging in physical activity,
particularly on exertion. Symptoms appear to be contingent upon increased stress levels and heavy meals.""",

    """History of Present Illness: The patient reports a history of influenza with high fever
(up to 41 °C) approximately two months ago. He now presents again with flu-like symptoms,
including fever, but denies productive cough.
Family History: Father with a history of lung cancer."""
]


# Apply the Pipeline to Sample Texts

In [None]:
data = spark.createDataFrame([[text] for text in sample_texts]).toDF("text")
result = fitted_pipeline.transform(data)

# Print the Results for NER

In [None]:
# Print results for all NER entities
print("NER entities")
result.selectExpr("explode(ner_chunk)").select("col.result", "col.begin", "col.end", "col.metadata.entity").show(100, truncate=False)

NER entities
+-----------------------------+-----+----+---------+
|result                       |begin|end |entity   |
+-----------------------------+-----+----+---------+
|Toronto                      |155  |161 |City     |
|Toronto                      |326  |332 |City     |
|Canada                       |335  |340 |COUNTRY  |
|India                        |425  |429 |COUNTRY  |
|Germany                      |432  |438 |COUNTRY  |
|Brazil                       |441  |446 |COUNTRY  |
|South Korea                  |449  |459 |COUNTRY  |
|Morocco                      |462  |468 |COUNTRY  |
|Australia                    |475  |483 |COUNTRY  |
|Mexico                       |544  |549 |COUNTRY  |
|Cairo                        |560  |564 |City     |
|Thomas                       |36   |41  |DRUG     |
|Atlanta                      |159  |165 |City     |
|Georgia                      |168  |174 |COUNTRY  |
|Atlanta                      |402  |408 |City     |
|Texas                        |55

# Print Results for Assertions

In [None]:
# Assertions (only for clinical entities: Drug, Biomarker, Cancer_dx)
print("Assertions")
result.selectExpr("explode(clinical_assertions)").select("col.metadata.ner_chunk", "col.begin", "col.end", "col.result").show(100, truncate=False)

Assertions
+-----------------------+-----+----+----------------------------+
|ner_chunk              |begin|end |result                      |
+-----------------------+-----+----+----------------------------+
|ovarian carcinoma      |175  |191 |Past                        |
|mucinous-type carcinoma|194  |216 |Past                        |
|cyclophosphamide       |324  |339 |Past                        |
|carboplatin            |352  |362 |Past                        |
|Taxotere               |595  |602 |Past                        |
|Epirubicin             |616  |625 |Past                        |
|Cyclophosphamide       |643  |658 |Past                        |
|metaplastic carcinoma  |935  |955 |conditional                 |
|cytokeratin AE1/AE3    |1116 |1134|Past                        |
|myoepithelial markers  |1142 |1162|Past                        |
|CK 5/6                 |1175 |1180|Past                        |
|p63                    |1183 |1185|Past                        |

# Entity and Assertion Visualization

In [None]:
# Visualize NER entities (without assertions)
print("Ner Result Entities ")
nlp.viz.NerVisualizer().display(result.collect()[0], 'ner_chunk')

# Visualize clinical entities with their assertions
print("\n\n Clinical Entities")
nlp.viz.AssertionVisualizer().display(result.collect()[5], 'clinical_entities', 'clinical_assertions')


Ner Result Entities 




 Clinical Entities
