![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Rule Based NER and Assertion Models
This notebook demonstrates the use of rule-based Named Entity Recognition (NER) combined with assertion detection models for structured information extraction from text.
Here we demo a series of Text Matcher models, each designed to identify and extract entities of interest, such as states, cities, and drugs, using pre-defined dictionaries and linguistic patterns. By applying these targeted matchers, we can ensure high precision in entity identification, especially in specialized contexts where standard models may underperform.

Beyond entity detection, the notebook also integrates Contextual Assertion Models, which determine the status of an entity in context. For example, whether a drug is mentioned as being possibly used (Detect Possible Assertion) or conditionally prescribed (Detect Conditional Assertion).


## **🎬 Colab Setup**

In [None]:
# import johnsnowlabs library
!pip install -q johnsnowlabs

In [None]:
# Upload license key for healthcare NLP
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical
nlp.settings.enforce_versions=True
nlp.install(refresh_install=True)

In [None]:
# import Spark NLP and Spark NLP for Healthcare from  johnsnowlabs library
from johnsnowlabs import nlp, medical

nlp.install()

In [None]:
# import required modules
from sparknlp.base import *
from pyspark.ml import Pipeline

spark = nlp.start()

# 🔎 MODELS
Models used in this pipeline and the entities they extract.

| Index | Model | Entities |
|---:|:------------------------|:-|
|  1 | [country_matcher](https://nlp.johnsnowlabs.com/2024/10/23/country_matcher_en.html) | Country |
|  2 | [state_matcher](https://nlp.johnsnowlabs.com/2024/09/11/state_matcher_en.html) | State |
|  3 | [city_matcher](https://nlp.johnsnowlabs.com/2024/07/02/city_matcher_en.html) | City |
|  4 | [drug_matcher](https://nlp.johnsnowlabs.com/2024/03/19/drug_matcher_en.html) | Drug |
|  5 | [biomarker_matcher](https://nlp.johnsnowlabs.com/2024/03/06/biomarker_matcher_en.html) | Biomarker |
|  6 | [cancer_diagnosis_matcher](https://nlp.johnsnowlabs.com/2024/06/17/cancer_diagnosis_matcher_en.html) | Cancer_dx |
|  7 | [contextual_assertion_conditional](https://nlp.johnsnowlabs.com/2025/03/12/contextual_assertion_conditional_en.html) | Conditional |
|  8 | [contextual_assertion_possible](https://nlp.johnsnowlabs.com/2025/03/12/contextual_assertion_possible_en.html) | Possible |
|  9 | [contextual_assertion_someone_else](https://nlp.johnsnowlabs.com/2024/06/26/contextual_assertion_someone_else_en.html) | Someone_else |
| 10 | [contextual_assertion_absent](https://nlp.johnsnowlabs.com/2024/07/03/contextual_assertion_absent_en.html) | Absent |
| 11 | [contextual_assertion_past](https://nlp.johnsnowlabs.com/2024/07/04/contextual_assertion_past_en.html) | Past |

# Rule-based Pipeline with Separated Entity Processing
This pipeline combines rule-based NER models and assertion models

In [6]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

country_matcher = medical.TextMatcherModel.pretrained("country_matcher", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("country")\
    .setMergeOverlapping(True)

state_matcher = medical.TextMatcherModel.pretrained("state_matcher", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("state")\
    .setMergeOverlapping(True)

city_matcher = medical.TextMatcherModel.pretrained("city_matcher", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("city")\
    .setMergeOverlapping(True)

drug_matcher = medical.TextMatcherModel.pretrained("drug_matcher", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("drug")

biomarker_matcher = medical.TextMatcherModel.pretrained("biomarker_matcher", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("biomarker")

cancer_diagnosis_matcher = medical.TextMatcherModel.pretrained("cancer_diagnosis_matcher", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("cancer_dx")\
    .setMergeOverlapping(True)

# Merge all NER entities
chunk_merger = medical.ChunkMergeApproach()\
    .setInputCols(["drug", "biomarker", "cancer_dx","country", "state", "city"])\
    .setOutputCol("ner_chunk")\
    .setSelectionStrategy("Sequential")

# Merge clinical entities (for assertions)
clinical_merger = medical.ChunkMergeApproach()\
    .setInputCols(["drug", "biomarker", "cancer_dx"])\
    .setOutputCol("clinical_entities")\
    .setSelectionStrategy("DiverseLonger")\
    .setOrderingFeatures(["ChunkLength"])

# Assertion models (only for clinical entities)
contextual_assertion_conditional = medical.ContextualAssertion.pretrained("contextual_assertion_conditional", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "clinical_entities"])\
    .setOutputCol("assertion_conditional")

contextual_assertion_possible = medical.ContextualAssertion.pretrained("contextual_assertion_possible", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "clinical_entities"])\
    .setOutputCol("assertion_possible")

contextual_assertion_someone_else = medical.ContextualAssertion.pretrained("contextual_assertion_someone_else", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "clinical_entities"])\
    .setOutputCol("assertion_someone_else")

contextual_assertion_absent = medical.ContextualAssertion.pretrained("contextual_assertion_absent", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "clinical_entities"])\
    .setOutputCol("assertion_absent")

contextual_assertion_past = medical.ContextualAssertion.pretrained("contextual_assertion_past", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "clinical_entities"])\
    .setOutputCol("assertion_past")

assertion_merger = medical.AssertionMerger()\
    .setInputCols(["assertion_conditional", "assertion_possible", "assertion_someone_else", "assertion_absent", "assertion_past"])\
    .setOutputCol("clinical_assertions")\
    .setMergeOverlapping(True)\
    .setSelectionStrategy("sequential")\
    .setAssertionSourcePrecedence("assertion_conditional, assertion_possible, assertion_someone_else, assertion_absent, assertion_past")\
    .setCaseSensitive(False)

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    country_matcher,
    state_matcher,
    city_matcher,
    drug_matcher,
    biomarker_matcher,
    cancer_diagnosis_matcher,
    chunk_merger,
    clinical_merger,
    contextual_assertion_conditional,
    contextual_assertion_possible,
    contextual_assertion_someone_else,
    contextual_assertion_absent,
    contextual_assertion_past,
    assertion_merger
])


sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
country_matcher download started this may take some time.
Approximate size to download 10.2 KB
[OK!]
state_matcher download started this may take some time.
Approximate size to download 6.1 KB
[OK!]
city_matcher download started this may take some time.
Approximate size to download 180.3 KB
[OK!]
drug_matcher download started this may take some time.
Approximate size to download 5.5 MB
[OK!]
biomarker_matcher download started this may take some time.
Approximate size to download 25.6 KB
[OK!]
cancer_diagnosis_matcher download started this may take some time.
Approximate size to download 42.8 KB
[OK!]
contextual_assertion_conditional download started this may take some time.
Approximate size to download 1.3 KB
[OK!]
contextual_assertion_possible download started this may take some time.
Approximate size to download 1.7 KB
[OK!]
contextual_assertion_someone_else download 

# Fit the Pipeline

In [7]:
empty_data = spark.createDataFrame([[""]]).toDF("text")
fitted_pipeline = pipeline.fit(empty_data)

# Sample Clinical Notes to Tryout the Pipeline

In [81]:
sample_texts = [
    """
Name: Patel, Rina  Record Date: 1996-11-03  MR: 781093

Dr. Sofia Chen

Presentation Summary
A 48-year-old female admitted to Unity Health Institute in Toronto
for thyroidectomy.

PMH
She takes clopidogrel for her cardiovascular risk.
History of ibuprofen use for muscle pain and history of azithromycin for a sinus infection two months ago.
Denies remote history of other types of cancer.
She experiences mild gastric upset when taking ibuprofen, particularly if ingested without food.

HPI
Ms. Patel is a 48-year-old woman with a diagnosis of papillary thyroid carcinoma.
Ultrasound showes a 2.6 cm TI-RADS 5 nodule; FNA confirmes malignancy. Pre-op labs revealed suppressed TSH (0.14 mIU/L), elevated thyroglobulin (135 ng/mL), and BRAF V600E mutation positivity.
She underwent hemithyroidectomy with central neck dissection on 11/03/95. Pathology showed multifocal disease (largest 2.8 cm), capsular invasion, and 3/12 positive lymph nodes.
Serum thyroglobulin are suggestive of rising during periods of noncompliance with levothyroxine therapy.

Family History
Sister and brother both have asthma. Grandfather had lung cancer in his late 70s.

Social history
No smoking, alcohol or drug use history.
In the past 18 months, the patient has traveled to India, Germany and Brazil for both business and leisure.
Two months ago she visited her sister in Los Angeles, California.
She denies any symptoms during or after her trips.

Review of Systems
General: Denies fever, chills, unintended weight loss, or night sweats.
Patient resting in bed. Patient given azithromycin without any difficulty.
Patient denies nausea at this time. zofran declined.
Cardiovascular: Reports intermittent chest pain on exertion, worse with stress and heavy meals. Denies palpitations, syncope, or orthopnea.
Respiratory: Denies cough, hemoptysis, or shortness of breath at rest.
""",

"""
Name: Johnson, Maria  Record Date: 2000-04-12  MR: 894562

Dr. Daniel Romero

Presentation Summary
A 55-year-old female admitted to St. Mary’s Medical Center in Chicago for evaluation and surgical management of colorectal carcinoma.

PMH
She is currently on atorvastatin for hyperlipidemia.
History of amoxicillin use for recurrent sinus infections and history of naproxen for joint pain one year ago.
Denies prior history of breast or thyroid cancer.
She experiences mild dizziness when taking amoxicillin, particularly if combined with alcohol.

HPI
Ms. Johnson is a 55-year-old woman with a recent diagnosis of colorectal adenocarcinoma. Colonoscopy revealed a 4.1 cm mass in the colon; biopsy confirmed malignancy. Pre-op biomarkers showed:
CEA: 18.2 ng/mL (elevated)
KRAS mutation: Positive
TSH: 2.1 mIU/L (within normal range)
CT scan of the abdomen demonstrated possible lymph node involvement. Findings were suggestive of early hepatic metastasis, though not definitive.
She underwent subtotal colectomy with lymph node dissection on 04/10/96. Pathology shows moderately differentiated adenocarcinoma with 2/15 positive lymph nodes.

Family History
Sister diagnosed with type 2 diabetes.
Brother has hypertension.
Grandfather died of prostate cancer at age 82.
Other family members reported history of cardiovascular disease.

Social History
No smoking or recreational drug use. Occasional alcohol.
In the past 2 years, the patient has traveled to Spain, Mexico, and South Korea for work.
Three months ago she stayed with her brother in Miami, Florida.
She denies any gastrointestinal symptoms during or after her trips.

Review of Systems
General: Denies fever, chills, or unintended weight loss. Patient resting comfortably.
GI: Repots intermittent abdominal pain, worse after large meals. Patient denies current nausea. Ondansetron declined when offered postoperatively.
Cardiovascular: Reports intermittent palpitations, particularly if under stress. Denies syncope or chest pressure.
Respiratory: Denies cough, hemoptysis, or shortness of breath.
Medications/Drug exposures: Currently on atorvastatin. History of amoxicillin and naproxen use. Denies anticoagulant, opioid, or benzodiazepine use.
""",

"""
Patient ID: MR-552341
Name: Alvarez, David  Date: 2015-07-22
Consulting Physician: Dr. Helen Matsuda

Initial Encounter
A 62-year-old male presented to Mercy Medical Center in Baltimore for evaluation of a persistent cough and unintentional weight loss.

Medical Background
Current therapy: Metoprolol for hypertension.
History of prednisone use for bronchitis and history of doxycycline for pneumonia five years ago.
Denies history of prior malignancy.
Reports dizziness with metoprolol, particularly if taken on an empty stomach.

Clinical Course
Mr. Alvarez underwent CT chest, which revealed a 3.9 cm spiculated lesion in the right upper lobe. PET scan demonstrated increased uptake in hilar nodes, suggestive of lung cancer metastatic disease.
Biopsy confirmed non–small cell lung carcinoma (NSCLC).
Molecular markers included:
EGFR mutation: Negative
ALK rearrangement: Positive
CEA: 12.5 ng/mL (elevated)
The patient was counseled regarding targeted therapy options. He declined enrollment in a clinical trial but is considering ALK-inhibitor therapy.

Family & Genetic History
Sister has rheumatoid arthritis.
Brother has COPD.
Grandfather died of gastric cancer at age 79.
Other family members with cardiovascular disease and stroke.

Lifestyle & Exposures
Denies tobacco or recreational drug use, but reports a 15-year past history of smoking, quit 20 years ago.
Occasional alcohol intake.
Recent travel to Japan, Canada, and Argentina for conferences.
One month ago he went to a concert in Houston, Texas.
He denies any respiratory symptoms during or after his travels.

System Review
General: Fatigue and 5 kg weight loss over 3 months. Denies fever or chills.
Respiratory: Chronic cough with streaks of blood; denies wheezing at rest.
Cardiac: Reports palpitations, particularly if walking uphill; denies syncope.
GI: No abdominal pain, but appetite loss. Patient denies nausea; ondansetron declined when offered in ED.
Neurological: Denies headaches or seizures.
"""
]


# Apply the Pipeline to Sample Texts

In [82]:
data = spark.createDataFrame([[text] for text in sample_texts]).toDF("text")
result = fitted_pipeline.transform(data)

# Print the Results for NER

In [83]:
# Print results for all NER entities
print("NER entities")
result.selectExpr("explode(ner_chunk)").select("col.result", "col.begin", "col.end", "col.metadata.entity").show(100, truncate=False)

NER entities
+---------------------------+-----+----+---------+
|result                     |begin|end |entity   |
+---------------------------+-----+----+---------+
|Toronto                    |153  |159 |City     |
|clopidogrel                |195  |205 |DRUG     |
|ibuprofen                  |247  |255 |DRUG     |
|azithromycin               |292  |303 |DRUG     |
|cancer                     |383  |388 |Cancer_dx|
|ibuprofen                  |438  |446 |DRUG     |
|papillary thyroid carcinoma|546  |572 |Cancer_dx|
|malignancy                 |634  |643 |Cancer_dx|
|TSH                        |678  |680 |Biomarker|
|thyroglobulin              |705  |717 |Biomarker|
|BRAF                       |736  |739 |Biomarker|
|Serum thyroglobulin        |946  |964 |Biomarker|
|levothyroxine              |1028 |1040|DRUG     |
|lung cancer                |1120 |1130|Cancer_dx|
|India                      |1257 |1261|COUNTRY  |
|Germany                    |1264 |1270|COUNTRY  |
|Brazil           

# Print Results for Assertions

In [85]:
# Assertions (only for clinical entities: Drug, Biomarker, Cancer_dx)
print("Assertions")
result.selectExpr("explode(clinical_assertions)").select("col.metadata.ner_chunk", "col.begin", "col.end", "col.result").show(100, truncate=False)

Assertions
+-------------------+-----+----+----------------------------+
|ner_chunk          |begin|end |result                      |
+-------------------+-----+----+----------------------------+
|ibuprofen          |247  |255 |Past                        |
|azithromycin       |292  |303 |Past                        |
|cancer             |383  |388 |absent                      |
|ibuprofen          |438  |446 |conditional                 |
|Serum thyroglobulin|946  |964 |possible                    |
|levothyroxine      |1028 |1040|possible                    |
|lung cancer        |1120 |1130|associated_with_someone_else|
|zofran             |1634 |1639|absent                      |
|amoxicillin        |303  |313 |Past                        |
|naproxen           |365  |372 |Past                        |
|thyroid cancer     |437  |450 |absent                      |
|amoxicillin        |496  |506 |conditional                 |
|prostate cancer    |1243 |1257|associated_with_someone_els

# Entity and Assertion Visualization

## Visualize NER entities (without assertions)

In [86]:
print("\nAll Ner Result Entities (without assertions)\n\n")
# fist text sample
nlp.viz.NerVisualizer().display(result.collect()[0], 'ner_chunk')


All Ner Result Entities (without assertions)




## Visualize Clinical Entities with their Assertions

In [87]:
print("\nOnly NER Clinical Entities with their assertions\n\n")
#fist text sample
nlp.viz.AssertionVisualizer().display(result.collect()[0], 'clinical_entities', 'clinical_assertions')


Only NER Clinical Entities with their assertions


