![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/40.1.Text_Matcher_Internal.ipynb)

# 📜TextMatcherInternal

🔍 What is TextMatcherInternal?
TextMatcherInternal is an annotator in Spark NLP used to detect exact phrase matches within a document, based on a predefined list of entity phrases.

This component performs token-level exact matching between the input text and a set of phrases provided by the user, typically through an external text file. When a match is found, it creates a CHUNK annotation containing the matched text span.

✅ When to Use It
TextMatcherInternal is particularly useful in scenarios such as:

Detecting predefined terms in clinical, legal, or domain-specific documents

Keyword-based filtering or rule-based entity recognition

Pre-labeling data for supervised training or rule-based NER systems


**🔗 Helpful Links:**

For extended examples of usage, see the [Spark NLP Workshop Repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/healthcare-nlp)

Python Documentation: [TextMatcher](https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/matcher/text_matcher/index.html#sparknlp.annotator.matcher.text_matcher.TextMatcher)

Scala Documentation: [TextMatcher](https://sparknlp.org/api/com/johnsnowlabs/nlp/annotators/TextMatcher)



## **🖨️ Input/Output Annotation Types**

- Input: `SENTECE`, `TOKEN`

- Output: `CHUNK`

## **🔎 Parameters**

## TextMatcherInternal Parameters

| **Parameter** | **Description** |
| --- | --- |
| **entities** | ExternalResource for entities. A text file of predefined phrases must be provided. |
| **caseSensitive** | Whether to match regardless of case. Defaults to `True`. |
| **mergeOverlapping** | Whether to merge overlapping matched chunks. Defaults to `False`. |
| **entityValue** | Value for the entity metadata field. |
| **buildFromTokens** | Whether the matcher should take the CHUNK from TOKEN. |
| **dictionary** | External dictionary for lemmatizer. |
| **enableLemmatizer** | Whether to enable lemmatizer. Defaults to `False`. |
| **enableStemmer** | Whether to enable stemmer. Defaults to `False`. |
| **stopWords** | List of stop words to be removed. Defaults to `None`. |
| **cleanStopWords** | Whether to clean stop words. Defaults to `False`. |
| **shuffleEntitySubTokens** | Whether to generate and use variations (permutations) of the entity phrases. Defaults to `False`. |
| **safeKeywords** | Keywords to preserve during stopword removal when `cleanStopWords` is enabled. Defaults to empty. |
| **excludePunctuation** | If `True`, punctuation will be removed from the text. Defaults to `True`. |
| **cleanKeywords** | Additional keywords to be removed alongside default stopwords. Defaults to empty. |
| **excludeRegexPatterns** | Regex patterns used to drop matched chunks. Defaults to empty. |
| **returnChunks** | Controls whether to return the original text chunks from input or the matched (e.g., stemmed/lemmatized) phrases. Can be `'original'` or `'matched'`. Defaults to `'original'`. |
| **skipMatcherAugmentation** | Whether to skip matcher augmentation. Defaults to `False`. |
| **skipSourceTextAugmentation** | Whether to skip source text augmentation. Defaults to `False`. |

---

## TextMatcherInternalModel Parameters

| **Parameter** | **Description** |
| --- | --- |
| **mergeOverlapping** | Whether to merge overlapping matched chunks. Defaults to `False`. |
| **entityValue** | Value for the entity metadata field. |
| **buildFromTokens** | Whether the matcher should take the CHUNK from TOKEN. |
| **enableLemmatizer** | Whether to enable lemmatizer. Defaults to `False`. |
| **enableStemmer** | Whether to enable stemmer. Defaults to `False`. |
| **stopWords** | List of stop words to be removed. Defaults to `None`. |
| **cleanStopWords** | Whether to clean stop words. Defaults to `False`. |
| **returnChunks** | Whether to return original chunks or matched chunks. Defaults to original chunks. |
| **safeKeywords** | Keywords to preserve during stopword removal when `cleanStopWords` is enabled. Defaults to empty. |
| **excludePunctuation** | If `True`, punctuation will be removed from the text. Defaults to `True`. |
| **cleanKeywords** | Additional keywords to be removed alongside default stopwords. Defaults to empty. |
| **excludeRegexPatterns** | Regex patterns used to drop matched chunks. Defaults to empty. |


## **🎬 Colab Setup**

In [None]:
import json
import os
import random
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.5.1  spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install --upgrade -q spark-nlp-display

In [3]:
import json
import os

import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.common import *
from sparknlp.training import *
from sparknlp.annotator import *

from sparknlp_jsl.base import *
from sparknlp_jsl.annotator import *

from pyspark.ml import Pipeline
from pyspark.sql.types import StringType
import pyspark.sql.types as T
import pyspark.sql.functions as F

import functools
import numpy as np
import pandas as pd
from scipy import spatial

params = {
    "spark.driver.memory":"50G",
    "spark.driver.maxResultSize":"5G",
}

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)
print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 6.1.3
Spark NLP_JSL Version : 6.1.1


# Basic Usage of `TextMatcherInternal`

## How to Use `TextMatcherInternal`

First of all, we should create a source file that includes all the chunks or tokens we need to capture. In the example below, we use `#` as a delimiter to separate the label and entity. So we need to set parameter like this `setDelimiter('#')`.

In [4]:
matcher_drug = """
Aspirin 100mg#Drug
aspirin#Drug
paracetamol#Drug
amoxicillin#Drug
ibuprofen#Drug
lansoprazole#Drug
"""

with open ('matcher_drug.csv', 'w') as f:
  f.write(matcher_drug)

In [5]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

entityExtractor = TextMatcherInternal()\
    .setInputCols(["document", "token"])\
    .setEntities("matcher_drug.csv")\
    .setOutputCol("matched_text")\
    .setCaseSensitive(False)\
    .setDelimiter("#")\
    .setMergeOverlapping(False)

mathcer_pipeline = Pipeline(
    stages=[
        documentAssembler,
        tokenizer,
        entityExtractor
])

text = """
John's doctor prescribed aspirin 100mg for his heart condition, along with paracetamol for his fever,
amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD.
"""

data = spark.createDataFrame([[text]]).toDF("text")

result = mathcer_pipeline.fit(data).transform(data)

In [6]:
result.select(F.explode(F.arrays_zip(result.matched_text.result,
                                     result.matched_text.begin,
                                     result.matched_text.end,
                                     result.matched_text.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=70)

+-------------+-----+---+-----+
|        chunk|begin|end|label|
+-------------+-----+---+-----+
|aspirin 100mg|   26| 38| Drug|
|  amoxicillin|  103|113| Drug|
| lansoprazole|  171|182| Drug|
|  paracetamol|   76| 86| Drug|
|      aspirin|   26| 32| Drug|
|    ibuprofen|  135|143| Drug|
+-------------+-----+---+-----+



*As* you see above mather_drug file includes 2 similar entities aspirin and aspirin 100mg and our text includes both of them So if you want to see both of them you need to set `MergeOverlapping` parameter as `False`. You can look at the below example.

In [7]:
entityExtractor = TextMatcherInternal()\
    .setInputCols(["document", "token"])\
    .setEntities("matcher_drug.csv") \
    .setOutputCol("matched_text")\
    .setDelimiter("#")\
    .setCaseSensitive(False)\
    .setMergeOverlapping(False)

mathcer_pipeline = Pipeline(
    stages = [
        documentAssembler,
        tokenizer,
        entityExtractor
])

result = mathcer_pipeline.fit(data).transform(data)

In [8]:
result.select(F.explode(F.arrays_zip(result.matched_text.result,
                                     result.matched_text.begin,
                                     result.matched_text.end,
                                     result.matched_text.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=70)

+-------------+-----+---+-----+
|        chunk|begin|end|label|
+-------------+-----+---+-----+
|aspirin 100mg|   26| 38| Drug|
|  amoxicillin|  103|113| Drug|
| lansoprazole|  171|182| Drug|
|  paracetamol|   76| 86| Drug|
|      aspirin|   26| 32| Drug|
|    ibuprofen|  135|143| Drug|
+-------------+-----+---+-----+



When we set the `CaseSensitive` parameter to `True`, it means we're considering the case sensitivity of chunks in the source file. Consequently, some chunks may not be visible due to differences in their case compared to the source file.

In [9]:
entityExtractor = TextMatcherInternal()\
    .setInputCols(["document", "token"])\
    .setEntities("matcher_drug.csv") \
    .setOutputCol("matched_text")\
    .setDelimiter("#")\
    .setCaseSensitive(True)\
    .setMergeOverlapping(False)

mathcer_pipeline = Pipeline(
    stages = [
        documentAssembler,
        tokenizer,
        entityExtractor
])

matcher_model = mathcer_pipeline.fit(data)
result = matcher_model.transform(data)

In [10]:
result.select(F.explode(F.arrays_zip(result.matched_text.result,
                                     result.matched_text.begin,
                                     result.matched_text.end,
                                     result.matched_text.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=70)

+------------+-----+---+-----+
|       chunk|begin|end|label|
+------------+-----+---+-----+
| amoxicillin|  103|113| Drug|
|lansoprazole|  171|182| Drug|
| paracetamol|   76| 86| Drug|
|     aspirin|   26| 32| Drug|
|   ibuprofen|  135|143| Drug|
+------------+-----+---+-----+



## Multiple Entities

In [11]:
multiple_entites= """
Aspirin 100mg#Drug
paracetamol#Drug
amoxicillin#Drug
ibuprofen#Drug
lansoprazole#Drug
fever#Symptom
headache#Symptom
tonsilitis#Disease
GORD#Disease
heart condition#Disease
"""

with open ('multiple_entities.csv', 'w') as f:
  f.write(multiple_entites)

In [12]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

entityExtractor = TextMatcherInternal() \
    .setInputCols(["document", "token"]) \
    .setEntities("multiple_entities.csv") \
    .setOutputCol("matched_text")\
    .setDelimiter("#")\
    .setCaseSensitive(False)\
    .setDelimiter("#")

matcher_pipeline = Pipeline().setStages([
                  documentAssembler,
                  tokenizer,
                  entityExtractor])

text = """
John's doctor prescribed aspirin 100mg for his heart condition, along with paracetamol for his fever and headache,
amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD.
"""
data = spark.createDataFrame([[text]]).toDF("text")

matcher_model = matcher_pipeline.fit(data)
result = matcher_model.transform(data)

In [13]:
result.select(F.explode(F.arrays_zip(result.matched_text.result,
                                     result.matched_text.begin,
                                     result.matched_text.end,
                                     result.matched_text.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=70)

+---------------+-----+---+-------+
|          chunk|begin|end|  label|
+---------------+-----+---+-------+
|  aspirin 100mg|   26| 38|   Drug|
|    amoxicillin|  116|126|   Drug|
|   lansoprazole|  184|195|   Drug|
|    paracetamol|   76| 86|   Drug|
|      ibuprofen|  148|156|   Drug|
|          fever|   96|100|Symptom|
|       headache|  106|113|Symptom|
|heart condition|   48| 62|Disease|
|     tonsilitis|  136|145|Disease|
|           GORD|  205|208|Disease|
+---------------+-----+---+-------+



## `TextMatcherInternalModel`




In [14]:
entityExtractor = TextMatcherInternal() \
    .setInputCols(["document", "token"]) \
    .setEntities("matcher_drug.csv") \
    .setOutputCol("matched_text")\
    .setCaseSensitive(False)\
    .setDelimiter("#")

matcher_pipeline = Pipeline().setStages([
                  documentAssembler,
                  tokenizer,
                  entityExtractor])

text = """John's doctor prescribed aspirin 100mg for his heart condition, along with paracetamol for his fever and headache,
amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD."""

data = spark.createDataFrame([[text]]).toDF("text")

result = matcher_pipeline.fit(data).transform(data)

Saving the approach to disk

In [15]:
matcher_model.stages[-1].write().overwrite().save("matcher_model")

Loading the saved model and using it with the `TextMatcherModel()` via `load`.

In [16]:
entity_ruler = TextMatcherInternalModel.load('./matcher_model') \
    .setInputCols(["document", "token"]) \
    .setOutputCol("matched_text")\

pipeline = Pipeline(
    stages=[
        documentAssembler,
        tokenizer,
        entity_ruler
])

empty_data = spark.createDataFrame([[""]]).toDF("text")
pipeline_model = pipeline.fit(data)

Checking the result

In [17]:
result = pipeline_model.transform(data)

result.select(F.explode(F.arrays_zip(result.matched_text.result,
                                     result.matched_text.begin,
                                     result.matched_text.end,
                                     result.matched_text.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=70)

+---------------+-----+---+-------+
|          chunk|begin|end|  label|
+---------------+-----+---+-------+
|heart condition|   47| 61|Disease|
|     tonsilitis|  135|144|Disease|
|           GORD|  204|207|Disease|
|  aspirin 100mg|   25| 37|   Drug|
|    amoxicillin|  115|125|   Drug|
|   lansoprazole|  183|194|   Drug|
|    paracetamol|   75| 85|   Drug|
|      ibuprofen|  147|155|   Drug|
|          fever|   95| 99|Symptom|
|       headache|  105|112|Symptom|
+---------------+-----+---+-------+



#Advanced Usage of TextMatcherInternal

In this section, we demonstrate advanced features of the `TextMatcherInternal` annotator that enhance its flexibility and robustness in text matching tasks. These features include lemmatization and stemming for matching different word forms, customizable stop word removal with safe keywords preservation, punctuation exclusion, pattern-based chunk exclusion via regex, and the ability to control whether to return original or transformed matched text. We also show how to generate phrase permutations for more comprehensive matching, and options to skip automatic augmentation steps.

Consider the following example sentence for illustration:

> "Patient was able to talk briefly about recent life stressors during evaluation of psychiatric state.
She reports difficulty sleeping and ongoing anxiety. Denies suicidal ideation."


##Basic Pipeline Components

In [18]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel\
    .pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]


##Example Text

In [19]:
text = """
Patient was able to talk briefly about recent life stressors during evaluation of psychiatric state.
She reports difficulty sleeping and ongoing anxiety. Denies suicidal ideation.
"""

empty_data = spark.createDataFrame([[""]]).toDF("text")
data = spark.createDataFrame([[text]]).toDF("text")

### 🌱 Demonstrating Lemmatizer and Stemmer Features

To showcase the effects of lemmatization and stemming on text matching, we use a set of example phrases that include different word forms and variations.  

For instance, words like **"sleeping"**  will be matched to their base forms  **"sleep"** when lemmatization is enabled.

Here is a sample list of phrases saved to a file (`test-phrases.txt`) that we will use for matching against input text to demonstrate these features:



In [20]:
test_phrases = """
stressor
difficulty sleep
"""

with open("test-phrases.txt", "w") as file:
    file.write(test_phrases)

In [21]:
text_matcher = TextMatcherInternal()\
    .setInputCols(["sentence","token"])\
    .setOutputCol("matched_text")\
    .setEntities("./test-phrases.txt")\
    .setEnableLemmatizer(True) \
    .setEnableStemmer(True) \
    .setCleanStopWords(True) \
    .setBuildFromTokens(True)\
    .setReturnChunks("original")\
    .setExcludePunctuation(True)

text_matcher_pipeline= Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        text_matcher
])

text_matcher_model = text_matcher_pipeline.fit(empty_data)
text_matcher_result_df = text_matcher_model.transform(data)

In [22]:
flattener_text_matcher = Flattener()\
    .setInputCols("matched_text") \
    .setExplodeSelectedFields({
        "matched_text": [
            "metadata.entity as entity",
            "begin as begin",
            "end as end",
            "result as result",
            "metadata.original_or_matched as matched "
            ]
         })

flattener_text_matcher.transform(text_matcher_result_df).show(n=30,truncate=False)

+------+-----+---+-------------------+----------------+
|entity|begin|end|result             |matched         |
+------+-----+---+-------------------+----------------+
|entity|52   |60 |stressors          |stressor        |
|entity|114  |132|difficulty sleeping|difficulty sleep|
+------+-----+---+-------------------+----------------+



### 🔁 shuffleEntitySubTokens


The **`shuffleEntitySubTokens`** parameter controls whether token-level permutations of entity phrases should be generated and used during matching.

When set to `True`, the matcher will automatically generate all possible orderings (permutations) of the tokens within each entity phrase and try to match them against the input text.

For example, if `"sleep difficulty"` is in your entity list:
- With `shuffleEntitySubTokens=True`, phrases like `"difficulty sleep"` will also match.
- With `shuffleEntitySubTokens=False`, only the exact phrase `"sleep difficulty"` will match.

> ⚠️ **Note:** This parameter is only supported in the `TextMatcherInternal` class (the training/approach component). It is **not available** in the `TextMatcherInternalModel` class (the trained model component).

In [23]:
test_phrases = """
suicidal deny
sleep difficulty
"""

with open("test-phrases.txt", "w") as file:
    file.write(test_phrases)

In [24]:
text_matcher = TextMatcherInternal()\
    .setInputCols(["sentence","token"])\
    .setOutputCol("matched_text")\
    .setEntities("./test-phrases.txt")\
    .setEnableLemmatizer(True) \
    .setEnableStemmer(True) \
    .setCleanStopWords(True) \
    .setShuffleEntitySubTokens(True)\
    .setBuildFromTokens(True)\
    .setReturnChunks("original")\

text_matcher_pipeline= Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        text_matcher
])

text_matcher_model = text_matcher_pipeline.fit(empty_data)
text_matcher_result_df = text_matcher_model.transform(data)

In [25]:
flattener_text_matcher = Flattener()\
    .setInputCols("matched_text") \
    .setExplodeSelectedFields({
        "matched_text": [
            "metadata.entity as entity",
            "begin as begin",
            "end as end",
            "result as result",
            "metadata.original_or_matched as matched"
            ]
        }
    )

flattener_text_matcher.transform(text_matcher_result_df).show(n=30,truncate=False)

+------+-----+---+-------------------+----------------+
|entity|begin|end|result             |matched         |
+------+-----+---+-------------------+----------------+
|entity|114  |132|difficulty sleeping|difficulty sleep|
|entity|155  |169|Denies suicidal    |deni suicidal   |
+------+-----+---+-------------------+----------------+



### 🛑 Demonstrating Stop Word Removal

In this section, we demonstrate how removing stop words from both the **input text** and the **entity phrases** can significantly improve the accuracy and flexibility of phrase matching.

Stop words are common words such as **"and"**, **"the"**, or **"about"** that typically carry limited semantic meaning. These words can interfere with matching by creating unnecessary mismatches, especially when your entity phrases are clean and concise.

#### ⚙️ How it works:
- By enabling `.setCleanStopWords(True)`, the matcher will activate stop word filtering.
- If you do not specify a custom list using `.setStopWords([...])`, it defaults to **Spark ML’s built-in English stop word list**.
- These stop words will be removed from **both the source text** and the **entity phrases** before matching.
- To preserve specific terms that appear in the stop word list, you can use `.setSafeKeywords([...])`.




In [26]:
test_phrases = """
evaluation psychiatric state
"""

with open("test-phrases.txt", "w") as file:
    file.write(test_phrases)

In [27]:
text_matcher = TextMatcherInternal()\
    .setInputCols(["sentence","token"])\
    .setOutputCol("matched_text")\
    .setEntities("./test-phrases.txt")\
    .setEnableLemmatizer(True) \
    .setEnableStemmer(True) \
    .setCleanStopWords(True) \

text_matcher_pipeline= Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        text_matcher
])

text_matcher_model = text_matcher_pipeline.fit(empty_data)
text_matcher_result_df = text_matcher_model.transform(data)

In [28]:
flattener_text_matcher = Flattener()\
    .setInputCols("matched_text") \
    .setExplodeSelectedFields({
        "matched_text": [
            "metadata.entity as entity",
            "begin as begin",
            "end as end",
            "result as result",
            "metadata.original_or_matched as matched"
            ]
        }
    )

flattener_text_matcher.transform(text_matcher_result_df).show(n=30,truncate=False)

+------+-----+---+-------------------------------+----------------------------+
|entity|begin|end|result                         |matched                     |
+------+-----+---+-------------------------------+----------------------------+
|entity|69   |99 |evaluation of psychiatric state|evaluation psychiatric state|
+------+-----+---+-------------------------------+----------------------------+



###  🧹 Customizing Text Cleaning with Stop Words, Safe Keywords, and Punctuation Settings


The `TextMatcherInternal` annotator provides flexible options to control how the input text is cleaned before matching. This allows users to fine-tune what content should be removed or preserved for optimal matching accuracy. The following parameters are especially important:

- **`setStopWords`**: Defines a custom list of stop words to be removed from both the input text and the entity phrases. These words are excluded before any matching is performed.

- **`cleanKeywords`**: Specifies additional words (beyond the stop words) that should also be removed. Useful for removing domain-specific noise words.

- **`safeKeywords`**: Lists important keywords that should **not** be removed even if they appear in the stop word list. This ensures that key terms are preserved during cleaning.

- **`excludePunctuation`**: When set to `true`, all punctuation characters (such as `.`, `,`, `!`, etc.) are removed from the text before matching.

These options provide granular control over the preprocessing phase, allowing you to reduce noise while preserving meaningful content. This is particularly helpful when dealing with informal, noisy, or user-generated text.

In [29]:
test_phrases = """
evaluation psychiatric state
"""

with open("test-phrases.txt", "w") as file:
    file.write(test_phrases)

In [30]:
text_matcher = TextMatcherInternal()\
    .setInputCols(["sentence","token"])\
    .setOutputCol("matched_text")\
    .setEntities("./test-phrases.txt")\
    .setEnableLemmatizer(True) \
    .setEnableStemmer(True) \
    .setStopWords(["and", "in"]) \
    .setCleanStopWords(True) \
    .setCleanKeywords(["of"]) \
    .setExcludePunctuation(True)\


text_matcher_pipeline= Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        text_matcher
    ])

text_matcher_model = text_matcher_pipeline.fit(empty_data)
text_matcher_result_df = text_matcher_model.transform(data)

In [31]:
flattener_text_matcher = Flattener()\
    .setInputCols("matched_text") \
    .setExplodeSelectedFields({
        "matched_text": [
            "metadata.entity as entity",
            "begin as begin",
            "end as end",
            "result as result",
            "metadata.original_or_matched as matched"
            ]
        }
    )

flattener_text_matcher.transform(text_matcher_result_df).show(n=30,truncate=False)

+------+-----+---+-------------------------------+----------------------------+
|entity|begin|end|result                         |matched                     |
+------+-----+---+-------------------------------+----------------------------+
|entity|69   |99 |evaluation of psychiatric state|evaluation psychiatric state|
+------+-----+---+-------------------------------+----------------------------+



### 🔧 Matcher & Source Augmentation Settings


The parameters **`skipMatcherAugmentation`** and **`skipSourceTextAugmentation`** control whether automatic variations of the phrases (such as token permutations or transformations) are considered during matching. These augmentations are useful to improve match coverage but can be turned off for stricter or faster matching.

- **`skipMatcherAugmentation`**  
  When set to `True`, disables augmentation of the matcher entities.  
  For example, if the phrase is `"study biology"`, by default, variations like `"biology study"` might also be generated and matched — unless this is skipped.

- **`skipSourceTextAugmentation`**  
  When set to `True`, prevents augmentation (such as reordering tokens or normalizing text) of the **input** source text before matching.  
  Useful when you want to match the input exactly as it is written, without any alterations.

These parameters give you more control over how flexible or strict the matching process should be.


In [32]:
test_phrases = """
stressor
evaluation psychiatric state
"""

with open("test-phrases.txt", "w") as file:
    file.write(test_phrases)

In [33]:
text_matcher = TextMatcherInternal()\
    .setInputCols(["sentence","token"])\
    .setOutputCol("matched_text")\
    .setEntities("./test-phrases.txt")\
    .setEnableLemmatizer(True) \
    .setEnableStemmer(True) \
    .setCleanStopWords(True) \
    .setBuildFromTokens(True)\
    .setReturnChunks("original")\
    .setSkipMatcherAugmentation(True)\
    .setSkipSourceTextAugmentation(False)


text_matcher_pipeline= Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        text_matcher
    ])

text_matcher_model = text_matcher_pipeline.fit(empty_data)
text_matcher_result_df = text_matcher_model.transform(data)

In [34]:
flattener_text_matcher = Flattener()\
    .setInputCols("matched_text") \
    .setExplodeSelectedFields({
        "matched_text": [
            "metadata.entity as entity",
            "begin as begin",
            "end as end",
            "result as result",
            "metadata.original_or_matched as matched"
            ]
        }
    )

flattener_text_matcher.transform(text_matcher_result_df).show(n=30,truncate=False)

+------+-----+---+----------------------------+----------------------------+
|entity|begin|end|result                      |matched                     |
+------+-----+---+----------------------------+----------------------------+
|entity|69   |99 |evaluation psychiatric state|evaluation psychiatric state|
|entity|52   |60 |stressors                   |stressor                    |
+------+-----+---+----------------------------+----------------------------+



### 🔄 `returnChunks`

The **`returnChunks`** parameter controls the format of the matched phrases returned by the annotator.

You can set it to either:

- `"original"` – Returns the chunk exactly as it appears in the input text.
- `"matched"` – Returns the normalized version of the matched phrase (e.g., after stemming or lemmatization).

This setting is especially useful when using normalization techniques like stemming or lemmatization and you want to analyze which version of the entity was responsible for the match.

#### 📌 Important Note:
Even when `returnChunks` is set to `"matched"`, the **`begin`** and **`end`** indices in the resulting `CHUNK` annotation still refer to the **original text**.

In [35]:
test_phrases = """
stressor
evaluation psychiatric state
difficulty sleep
"""

with open("test-phrases.txt", "w") as file:
    file.write(test_phrases)

In [36]:
text_matcher = TextMatcherInternal()\
    .setInputCols(["sentence","token"])\
    .setOutputCol("matched_text")\
    .setEntities("./test-phrases.txt")\
    .setEnableLemmatizer(True) \
    .setEnableStemmer(True) \
    .setCleanStopWords(True) \
    .setBuildFromTokens(True)\
    .setReturnChunks("matched")\

text_matcher_pipeline= Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        text_matcher
])

text_matcher_model = text_matcher_pipeline.fit(empty_data)
text_matcher_result_df = text_matcher_model.transform(data)

In [37]:
flattener_text_matcher = Flattener()\
    .setInputCols("matched_text") \
    .setExplodeSelectedFields(
        {"matched_text": [
            "metadata.entity as entity",
            "begin as begin",
            "end as end",
            "result as result",
            "metadata.original_or_matched as original"
            ]
        }
    )

flattener_text_matcher.transform(text_matcher_result_df).show(n=30,truncate=False)

+------+-----+---+----------------------------+----------------------------+
|entity|begin|end|result                      |original                    |
+------+-----+---+----------------------------+----------------------------+
|entity|69   |99 |evaluation psychiatric state|evaluation psychiatric state|
|entity|52   |60 |stressor                    |stressors                   |
|entity|114  |132|difficulty sleep            |difficulty sleeping         |
+------+-----+---+----------------------------+----------------------------+



### 🚫 `excludeRegexPatterns`

The **`excludeRegexPatterns`** parameter allows you to filter out matched chunks based on **regular expression rules**.

You can provide a list of regex patterns:

- If a matched chunk **matches any of these patterns**, it will be **excluded** from the output.
- If the list is empty (default), **no matches will be filtered**.

This is especially useful when you want to remove:

- noisy or overly generic matches  
- specific codes or token formats  
- matches that follow certain undesirable patterns

In [38]:
text = """APNEA:
Presumed apnea of prematurity since < 34 wks gestation at birth.
HYPERBILIRUBINEMIA: At risk for hyperbilirubinemia d/t prematurity.
1/25-1/30: Received Amp/Gent while undergoing sepsis evaluation."""

empty_data = spark.createDataFrame([[""]]).toDF("text")
data = spark.createDataFrame([[text]]).toDF("text")

In [39]:
text_matcher = TextMatcherInternalModel.pretrained("hpo_matcher","en","clinical/models")\
    .setInputCols(["sentence","token"])\
    .setOutputCol("matched_text")\
    .setExcludeRegexPatterns(["^[A-Z][A-Z\s\-0-9]{2,}$"])

text_matcher_pipeline= Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        text_matcher
])

text_matcher_model = text_matcher_pipeline.fit(empty_data)
text_matcher_result_df = text_matcher_model.transform(data)

hpo_matcher download started this may take some time.
Approximate size to download 2 MB
[ | ]

  .setExcludeRegexPatterns(["^[A-Z][A-Z\s\-0-9]{2,}$"])


[OK!]


In [40]:
flattener_text_matcher = Flattener()\
    .setInputCols("matched_text") \
    .setExplodeSelectedFields({
        "matched_text": [
            "metadata.entity as entity",
            "begin as begin",
            "end as end",
            "result as result",
            ]
        }
    )

flattener_text_matcher.transform(text_matcher_result_df).show(n=30,truncate=False)

+------+-----+---+--------------------+
|entity|begin|end|result              |
+------+-----+---+--------------------+
|HPO   |16   |20 |apnea               |
|HPO   |16   |35 |apnea of prematurity|
|HPO   |104  |121|hyperbilirubinemia  |
|HPO   |186  |191|sepsis              |
+------+-----+---+--------------------+



#Usage of Pretrained Model

In [41]:
text = """APNEA: Presumed apnea of prematurity since < 34 wks gestation at birth.
HYPERBILIRUBINEMIA: At risk for hyperbilirubinemia d/t prematurity.
1/25-1/30: Received Amp/Gent while undergoing sepsis evaluation."""

empty_data = spark.createDataFrame([[""]]).toDF("text")
data = spark.createDataFrame([[text]]).toDF("text")

In [42]:
text_matcher = TextMatcherInternalModel.pretrained("hpo_matcher","en","clinical/models")\
    .setInputCols(["sentence","token"])\
    .setOutputCol("matched_text")


text_matcher_pipeline= Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        text_matcher
])

text_matcher_model = text_matcher_pipeline.fit(empty_data)
text_matcher_result_df = text_matcher_model.transform(data)

hpo_matcher download started this may take some time.
Approximate size to download 2 MB
[OK!]


In [43]:
flattener_text_matcher = Flattener()\
    .setInputCols("matched_text") \
    .setExplodeSelectedFields({
        "matched_text": [
            "metadata.entity as entity",
            "begin as begin",
            "end as end",
            "result as result",
            ]
        }
    )

flattener_text_matcher.transform(text_matcher_result_df).show(n=30,truncate=False)

+------+-----+---+--------------------+
|entity|begin|end|result              |
+------+-----+---+--------------------+
|HPO   |16   |20 |apnea               |
|HPO   |16   |35 |apnea of prematurity|
|HPO   |104  |121|hyperbilirubinemia  |
|HPO   |186  |191|sepsis              |
+------+-----+---+--------------------+



# Usage TextMatcherInternal with MetadataAnnotationConverter

In our pipeline, TextMatcher uses a stopword-reduced version of phrases to improve matching flexibility (e.g., matching denies pain to she denies having any pain).

However, once a match is found, the original (non-reduced) context is passed to the assertion model.

👉 Therefore, assertion labeling is not impacted by stopword removal during matching.
Assertion cues like no, not, denies, and temporal markers are still present in the input the model sees.

✅ Matching is flexible.  
✅ Assertion is accurate.  
🛡️ Each step uses what it needs.

In [44]:
text = """
The patient reports no pain in the chest and denies any history of hypertension.
No signs of infection or fever were noted. No signs of nausea or vomiting were observed.

"""

empty_data = spark.createDataFrame([[""]]).toDF("text")
data = spark.createDataFrame([[text]]).toDF("text")

In [45]:
test_phrases = """
pain
hypertension
medication
fever
vomit
"""

with open("test-phrases.txt", "w") as file:
    file.write(test_phrases)

In [46]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel\
    .pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

text_matcher = TextMatcherInternal()\
    .setInputCols(["sentence","token"])\
    .setOutputCol("matched_text")\
    .setEntities("./test-phrases.txt")\
    .setEnableLemmatizer(True) \
    .setEnableStemmer(True) \
    .setCleanStopWords(True) \
    .setBuildFromTokens(False)\
    .setReturnChunks("original")\

metadata_annotation_converter = MetadataAnnotationConverter()\
    .setInputCols("matched_text")\
    .setInputType("chunk") \
    .setBeginField("begin") \
    .setEndField("end") \
    .setResultField("original_or_matched") \
    .setOutputCol("new_chunk")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_assertion_dl = AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
    .setInputCols(["sentence", "new_chunk", "embeddings"]) \
    .setOutputCol("assertion_dl")\


text_matcher_pipeline= Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        text_matcher,
        metadata_annotation_converter,
        clinical_assertion_dl
])

text_matcher_model = text_matcher_pipeline.fit(empty_data)
text_matcher_result_df = text_matcher_model.transform(data)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
assertion_dl download started this may take some time.
Approximate size to download 1.3 MB
[OK!]


In [47]:
flattener_text_matcher = Flattener()\
    .setInputCols("matched_text") \
    .setExplodeSelectedFields({
        "matched_text": [
            "metadata.entity as entity",
            "begin as begin",
            "end as end",
            "result as result",
            "metadata.original_or_matched as matched"
            ]
        }
    )

flattener_text_matcher.transform(text_matcher_result_df).show(n=30,truncate=False)

+------+-----+---+------------+------------+
|entity|begin|end|result      |matched     |
+------+-----+---+------------+------------+
|entity|24   |27 |pain        |pain        |
|entity|68   |79 |hypertension|hypertension|
|entity|107  |111|fever       |fever       |
|entity|147  |154|vomiting    |vomit       |
+------+-----+---+------------+------------+



In [48]:
flattener_text_matcher = Flattener()\
    .setInputCols("new_chunk") \
    .setExplodeSelectedFields({
        "new_chunk": [
            "metadata.entity as entity",
            "begin as begin",
            "end as end",
            "result as result",
            ]
        }
    )

flattener_text_matcher.transform(text_matcher_result_df).show(n=30,truncate=False)

+------+-----+---+------------+
|entity|begin|end|result      |
+------+-----+---+------------+
|entity|24   |27 |pain        |
|entity|68   |79 |hypertension|
|entity|107  |111|fever       |
|entity|147  |154|vomit       |
+------+-----+---+------------+



In [49]:
flattener_text_matcher = Flattener()\
    .setInputCols("assertion_dl") \
    .setExplodeSelectedFields({
        "assertion_dl": [
            "metadata.ner_chunk as chunk",
            "begin as begin",
            "end as end",
            "result as result",
            "metadata.confidence as confidence"
            ]
        }
    )

flattener_text_matcher.transform(text_matcher_result_df).show(n=30,truncate=False)

+------------+-----+---+------+----------+
|chunk       |begin|end|result|confidence|
+------------+-----+---+------+----------+
|pain        |24   |27 |absent|0.9921    |
|hypertension|68   |79 |absent|0.9998    |
|fever       |107  |111|absent|0.9999    |
|vomit       |147  |154|absent|1.0       |
+------------+-----+---+------+----------+

