![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/11.01.TextMatcher_BigTextMatcher.ipynb)

# **TextMatcher / BigTextMatcher**

The objective of this notebook is to explore the different parameters and usage of the TextMatcher and BigTextMatcher annotators in Spark NLP.

**📖 Learning Objectives:**

1. Learn how to use TextMatcher and BigTextMatcher annotators in Spark NLP for text matching tasks, including loading pre-trained models and configuring the matching pipeline.

2. Understand the parameters and options available for the TextMatcher and BigTextMatcher annotators to customize the matching process based on specific use cases.

**🔗 Helpful Links:**

- Documentation : [TextMatcher](https://sparknlp.org/docs/en/annotators#textmatcher), [BigTextMatcher](https://sparknlp.org/docs/en/annotators#bigtextmatcher)

- Python Docs : [TextMatcher](https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/matcher/text_matcher/index.html), [BigTextMatcher](https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/matcher/big_text_matcher/index.html)

- Scala Docs : [TextMatcher](https://sparknlp.org/api/com/johnsnowlabs/nlp/annotators/TextMatcher.html), [BigTextMatcher](https://sparknlp.org/api/com/johnsnowlabs/nlp/annotators/btm/BigTextMatcher.html)

- For extended examples of usage, see the [TextMatcher](https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/main/scala/com/johnsnowlabs/nlp/annotators/TextMatcher.scala), [BigTextMatcher](https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/main/scala/com/johnsnowlabs/nlp/annotators/btm/BigTextMatcher.scala)

## **📜 Background**

`TextMatcher` and `BigTextMatcher` are powerful annotators in Spark NLP used for matching and extracting text patterns from a document. `TextMatcher` works by defining rules that specify the patterns to match and how to match them, while `BigTextMatcher` is optimized for larger datasets. Both annotators use similar rules to match patterns and are customizable, allowing users to adjust the matching process to meet specific use case requirements. They are widely used in various natural language processing applications, including information retrieval, sentiment analysis, and content categorization. By using these annotators, organizations can quickly and accurately match text patterns, retrieve relevant information, and improve decision-making, leading to better customer experiences.

## **🎬 Colab Setup**

In [None]:
!pip install spark-nlp
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spark-nlp
  Downloading spark_nlp-4.4.0-py2.py3-none-any.whl (486 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.4/486.4 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: spark-nlp
Successfully installed spark-nlp-4.4.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.4.0.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.0-py2.py3-none-any.whl size=311317145 sha256=272c87d48ad65679895ab9697e1831396dcfdf23ea0bf45dbab4137dbdbd7810


## ⚒️ Setup and Import Libraries

In [None]:
import sparknlp
from sparknlp.base import LightPipeline, Pipeline, ReadAs, Finisher
from sparknlp.annotator import SentenceDetector, Tokenizer, DocumentAssembler, TextMatcher, BigTextMatcher
from pyspark.sql import functions as F
import pandas as pd

# Start Spark Session
spark = sparknlp.start()

##  📑 **`TextMatcher`**

`TextMatcher` is a Spark NLP annotator that matches exact phrases in a document using tokens from a provided file. It requires `DOCUMENT` and `TOKEN` as input and produces CHUNK as output.

### **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `TOKEN`

- Output: `CHUNK`

### **🔎Parameters**

- `setEntities`: Sets the external resource for the entities.
- `setEntityValue`: Sets value for the entity metadata field.
- `setCaseSensitive`: Sets whether to match regardless of case, by default True.
- `setMergeOverlapping`: Sets whether to merge overlapping matched chunks, by default False.
- `setBuildFromTokens`: Sets whether the `TextMatcher` should take the `CHUNK` from `TOKEN` or not.

#### `setEntities`

- `setEntities` is a parameter in the `TextMatcher` component of Spark NLP that allows you to associate entities with the patterns you are matching. It takes a dictionary where the keys are the pattern names and the values are lists of entity names associated with that pattern.

- `setEntities(path, read_as=ReadAs.TEXT, options={'format': 'text'})`

  Parameters:

  **path**: str

  **read_as**: str, optional, by default  ReadAs.TEXT

  **options**: dict, optional, by default {“format”: “text”}

Here is an example usage of `setEntities`. First, let’s create a dataframe of a sample text:

In [None]:
# Create a dataframe from the sample_text
data = spark.createDataFrame([
["""As she traveled across the world, Emma visited many different places
and met many fascinating people. She walked the busy streets of Tokyo,
hiked the rugged mountains of Nepal, and swam in the crystal-clear waters
of the Caribbean. Along the way, she befriended locals like Akira, Rajesh,
and Maria, each with their own unique stories to tell. Emma's travels took her
to many cities, including New York, Paris, and Hong Kong, where she savored
delicious foods and explored vibrant cultures. No matter where she went,
Emma always found new wonders to discover and memories to cherish."""]
]).toDF("text")

Let’s define the names and locations that we seek to match and save them as text files:

In [None]:
# PERSON
person_matches = """
Emma
Akira
Rajesh
Maria
"""

with open('person_matches.txt', 'w') as f:
    f.write(person_matches)

# LOCATION
location_matches = """
Tokyo
Nepal
Caribbean
New York
Paris
Hong Kong
"""

with open('location_matches.txt', 'w') as f:
    f.write(location_matches)

Create the pipeline, and define `setEntities` in `TextMatcher()` to match the input text with PERSON and LOCATION entities above:

In [None]:
# Step 1: Transforms raw texts to `document` annotation
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

# Step 2: Gets the tokens of the text
tokenizer = Tokenizer() \
    .setInputCols("document") \
    .setOutputCol("token")

# Step 3: PERSON matcher
person_extractor = TextMatcher() \
    .setInputCols("document", "token") \
    .setEntities("person_matches.txt", ReadAs.TEXT) \
    .setOutputCol("person_entity")

# Step 4: LOCATION matcher
location_extractor = TextMatcher() \
    .setInputCols("document", "token") \
    .setEntities("location_matches.txt", ReadAs.TEXT) \
    .setOutputCol("location_entity")


pipeline = Pipeline().setStages([document_assembler,
                                 tokenizer,
                                 person_extractor,
                                 location_extractor
                                 ])

Fit and transform:

In [None]:
# Fit and transform to get a prediction
results = pipeline.fit(data).transform(data)

# Display the results
results.selectExpr("person_entity.result", "location_entity.result").show(truncate=False)

+----------------------------------+-----------------------------------------------------+
|result                            |result                                               |
+----------------------------------+-----------------------------------------------------+
|[Emma, Akira, Rajesh, Maria, Emma]|[Tokyo, Nepal, Caribbean, New York, Paris, Hong Kong]|
+----------------------------------+-----------------------------------------------------+



#### `setEntityValue`

- In Spark NLP's `TextMatcher`, the `setEntityValue` function allows you to set a custom value for the "entity" metadata field of the matched phrases. This can be particularly useful when you want to assign a specific label or category to the matched phrases in the output.

- The "entity" metadata field is a part of the output annotations that `TextMatcher` produces. By default, the value of the "entity" field is set to the matched phrase itself. However, you may want to assign a more meaningful label or category to the matched phrases to better understand or process them in later stages of your NLP pipeline.

- `setEntityValue(b)`

    **b**: str
    
    Value for the entity metadata field, by default entity

To see this lets look at the metadata of the previous example results.

In [None]:
# Fit and transform to get a prediction
results = pipeline.fit(data).transform(data)

# Display the results
results.selectExpr("person_entity.metadata", "location_entity.metadata").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|metadata                                                                                                                                                                                                                                   |metadata                                                                                                                                                                                                                                          

It can be seen that the metadata field is assigned as entity by default. Lets update both "entity" metadata field of the matched phrases by changing the pipeline.

In [None]:
# Step 1: Transforms raw texts to `document` annotation
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

# Step 2: Gets the tokens of the text
tokenizer = Tokenizer() \
    .setInputCols("document") \
    .setOutputCol("token")

# Step 3: PERSON matcher
person_extractor = TextMatcher() \
    .setInputCols("document", "token") \
    .setEntities("person_matches.txt", ReadAs.TEXT) \
    .setEntityValue("PERSON") \
    .setOutputCol("person_entity")

# Step 4: LOCATION matcher
location_extractor = TextMatcher() \
    .setInputCols("document", "token") \
    .setEntities("location_matches.txt", ReadAs.TEXT) \
    .setEntityValue("LOCATION") \
    .setOutputCol("location_entity")


pipeline = Pipeline().setStages([document_assembler,
                                 tokenizer,
                                 person_extractor,
                                 location_extractor
                                 ])

Phrases match with person_matches.txt file and phrases match with location_matches.txt file are assigned using `setEntityValue("PERSON")` and `setEntityValue("LOCATION")`, respectively. Therefore metadata results are updated.

In [None]:
# Fit and transform to get a prediction
results = pipeline.fit(data).transform(data)

# Display the results
results.selectExpr("person_entity.metadata", "location_entity.metadata").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|metadata                                                                                                                                                                                                                                   |metadata                                                                                                                                                                                                                              

#### `setCaseSensitive`

- The `setCaseSensitive` option in Spark NLP's `TextMatcher` is used to regulate the matching process's case sensitivity while looking for certain words or phrases in the input text. It accepts a boolean value in which:

- True: The `TextMatcher` takes the case into account while matching the text. In this situation, the `TextMatcher` must receive keywords or phrases that precisely match the case of the input text. For instance, the `TextMatcher` won't match "Apple" or "APPLE" in the input text if you're seeking for the term "apple."

- False: Case insensitivity will not be a factor in the TextMatcher's matching. This implies that regardless of how the keywords or phrases are presented in the input text, it will match them. In the same example, the TextMatcher will match "apple," "Apple," and "APPLE" in the input text if you are seeking for the keyword "apple."

- `setCaseSensitive(b)`

    **b**: bool

    Whether to match regardless of case, by default True

Let's see this by changing the names and locations files above.

In [None]:
# PERSON
person_matches = """
emma
Akira
rajesh
MARIA
"""

with open('person_matches.txt', 'w') as f:
    f.write(person_matches)

# LOCATION
location_matches = """
Tokyo
nepal
CARIBBEAN
New York
Paris
hong Kong
"""

with open('location_matches.txt', 'w') as f:
    f.write(location_matches)

In [None]:
# Step 1: Transforms raw texts to `document` annotation
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

# Step 2: Gets the tokens of the text
tokenizer = Tokenizer() \
    .setInputCols("document") \
    .setOutputCol("token")

# Step 3: PERSON matcher
person_extractor = TextMatcher() \
    .setInputCols("document", "token") \
    .setEntities("person_matches.txt", ReadAs.TEXT) \
    .setEntityValue("PERSON") \
    .setOutputCol("person_entity")

# Step 4: LOCATION matcher
location_extractor = TextMatcher() \
    .setInputCols("document", "token") \
    .setEntities("location_matches.txt", ReadAs.TEXT) \
    .setEntityValue("LOCATION") \
    .setOutputCol("location_entity")\
    .setCaseSensitive(False)


pipeline = Pipeline().setStages([document_assembler,
                                 tokenizer,
                                 person_extractor,
                                 location_extractor
                                 ])

In [None]:
# Fit and transform to get a prediction
results = pipeline.fit(data).transform(data)

# Display the results
results.selectExpr("person_entity.result", "location_entity.result").show(truncate=False)

+-------+-----------------------------------------------------+
|result |result                                               |
+-------+-----------------------------------------------------+
|[Akira]|[Tokyo, Nepal, Caribbean, New York, Paris, Hong Kong]|
+-------+-----------------------------------------------------+



It can be seen that the result of person_entity is able to match the text with only Akira since its `setCaseSensitive` is set to `True` by default. On the other hand, `setCaseSensitive` for location-TextMatcher is set to `False`, and the result of location_entity is able to match the each entity defined in the location_matches.txt file with the input text.

#### `setMergeOverlapping`

- In Spark NLP, the `setMergeOverlapping` parameter of the `TextMatcher` determines whether overlapping matched chunks should be merged. By default, this value is set to `False,` meaning overlapping matches will be kept separate entities.

- If `setMergeOverlapping` is `True,` the `TextMatcher` will merge overlapping matches into a single chunk. This is particularly useful when you have phrases with shared words or characters and want to consider them a single match.

- `setMergeOverlapping(b)`

    **b**: bool

    Whether to merge overlapping matched chunks


Here is an example to show how to use `setMergeOverlapping`, and its effect on the results.

In [None]:
# Create a dataframe from the sample_text
data = spark.createDataFrame([
    ("""The new AI technology is making great strides in areas like machine learning, natural language processing, and computer vision.""",)
]).toDF("text")

Define the names that we seek to match and save them as text files:

In [None]:
entities_matches = """
AI
AI technology
machine learning
natural language processing
language processing
computer vision
"""

with open('entities.txt', 'w') as f:
    f.write(entities_matches)

Create pipeline, `setMergeOverlapping()` is set to `False` by default, the Finisher will clean the annotations and exclude the metadata.

In [None]:
# Step 1: Transforms raw texts to `document` annotation
documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

# Step 2: Detects sentences within the document
sentenceDetector = SentenceDetector()\
  .setInputCols("document")\
  .setOutputCol("sentence")

# Step 3: Tokenizes the words within the document
tokenizer = Tokenizer()\
  .setInputCols(["document"])\
  .setOutputCol("token")

# Step 4: Matches the tokens with the entities defined in the `entities.txt` file
extractor = TextMatcher()\
  .setEntities("entities.txt")\
  .setInputCols("token", "sentence")\
  .setCaseSensitive(False)\
  .setOutputCol("entities")

# Step 5: Extracts only the matched entities from the `entities` column
finisher = Finisher() \
    .setInputCols("entities") \
    .setOutputCols("matched_entities")\
    .setIncludeMetadata(False) \
    .setCleanAnnotations(True)

# Create a pipeline containing all the stages
pipeline = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    extractor,
    finisher
  ])

# Fit and transform the DataFrame
result_setMergeOverlapping_False = pipeline.fit(data).transform(data)

# Show the results
result_setMergeOverlapping_False.select("matched_entities").show(truncate=False)

+--------------------------------------------------------------------------------------------------------+
|matched_entities                                                                                        |
+--------------------------------------------------------------------------------------------------------+
|[AI, AI technology, machine learning, natural language processing, language processing, computer vision]|
+--------------------------------------------------------------------------------------------------------+



Create the same pipeline while setting `setMergeOverlapping` to be `True`.

In [None]:
# Step 1: Transforms raw texts to `document` annotation
documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

# Step 2: Detects sentences within the document
sentenceDetector = SentenceDetector()\
  .setInputCols("document")\
  .setOutputCol("sentence")

# Step 3: Tokenizes the words within the document
tokenizer = Tokenizer()\
  .setInputCols("document")\
  .setOutputCol("token")

# Step 4: Matches the tokens with the entities defined in the `entities.txt` file
extractor = TextMatcher()\
  .setEntities("entities.txt")\
  .setInputCols("token", "sentence")\
  .setOutputCol("entities")\
  .setCaseSensitive(False)\
  .setMergeOverlapping(True)

# Step 5: Extracts only the matched entities from the `entities` column
finisher = Finisher() \
    .setInputCols("entities") \
    .setOutputCols("matched_entities")\
    .setIncludeMetadata(False) \
    .setCleanAnnotations(True)

# Create a pipeline containing all the stages
pipeline = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    extractor,
    finisher
  ])

# Fit and transform the DataFrame
result_setMergeOverlapping_True = pipeline.fit(data).transform(data)

# Show the results
result_setMergeOverlapping_True.select("matched_entities").show(truncate=False)

+-------------------------------------------------------------------------------+
|matched_entities                                                               |
+-------------------------------------------------------------------------------+
|[AI technology, machine learning, natural language processing, computer vision]|
+-------------------------------------------------------------------------------+



In [None]:
# Convert Spark DataFrames to pandas DataFrames
result_setMergeOverlapping_False_pd = result_setMergeOverlapping_False.select("matched_entities").toPandas()
result_setMergeOverlapping_True_pd = result_setMergeOverlapping_True.select("matched_entities").toPandas()

# Rename columns to distinguish between the two sets of results
result_setMergeOverlapping_False_pd = result_setMergeOverlapping_False_pd.rename(columns={"matched_entities": "matched_entities_no_merge"})
result_setMergeOverlapping_True_pd = result_setMergeOverlapping_True_pd.rename(columns={"matched_entities": "matched_entities_with_merge"})

# Concatenate the two pandas DataFrames, set max_colwidth for pandas
combined_results = pd.concat([result_setMergeOverlapping_False_pd, result_setMergeOverlapping_True_pd], axis=1)
pd.set_option('max_colwidth', None)

# Display the combined results
combined_results

Unnamed: 0,matched_entities_no_merge,matched_entities_with_merge
0,"[AI, AI technology, machine learning, natural language processing, language processing, computer vision]","[AI technology, machine learning, natural language processing, computer vision]"


#### `setBuildFromTokens`

- The `setBuildFromTokens` parameter in `TextMatcher` is used to determine whether the `TextMatcher` should build chunks from tokens or not. By deafult it is set to `False`, meaning the `TextMatcher` will not build chunks from tokens.

- If `setBuildFromTokens` is set to be `True`, the `TextMatcher` will consider individual tokens as potential matches for the provided entities. This can be useful when you want to match your entity list with the tokens in the text, rather than searching for the exact phrase.

- `setBuildFromTokens(b)

   **b**: bool

   Whether the `TextMatcher` should take the `CHUNK` from `TOKEN` or not

##  📑 **`BigTextMatcher`**

`BigTextMatcher` is an extension of Spark NLP's `TextMatcher`, designed for matching and extracting patterns from massive documents or corpora. It efficiently handles datasets too large for memory and performs distributed pattern matching using Spark NLP. The tool builds a data structure with input words or phrases, enabling quick matching against large datasets, surpassing `TextMatcher` in speed.

A text file of predefined phrases must be provided with `setStoragePath`.

### **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `TOKEN`

- Output: `CHUNK`

### **🔎Parameters**

- `setEntities`: Sets the external resource for the entities.
- `setCaseSensitive`: Sets whether to match regardless of case, by default True.
- `setMergeOverlapping`: Sets whether to merge overlapping matched chunks, by default False.
- `setTokenizer`: Sets TokenizerModel to use to tokenize input file for building a Trie.

The parameters `setEntities`, `setCaseSensitive`, and `setMergeOverlapping` in `BigTextMatcher` are used in the same way as they are used in `TextMatcher`.

In [None]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/news_category_train.csv

news_df = spark.read \
            .option("header", True) \
            .csv("news_category_train.csv")

In [None]:
news_df.show(5, truncate=50)

+--------+--------------------------------------------------+
|category|                                       description|
+--------+--------------------------------------------------+
|Business| Short sellers, Wall Street's dwindling band of...|
|Business| Private investment firm Carlyle Group, which h...|
|Business| Soaring crude prices plus worries about the ec...|
|Business| Authorities have halted oil export flows from ...|
|Business| Tearaway world oil prices, toppling records an...|
+--------+--------------------------------------------------+
only showing top 5 rows



In [None]:
 # write the target entities to txt file

entities = ['Wall Street', 'USD', 'stock', 'NYSE']
with open ('financial_entities.txt', 'w') as f:
    for i in entities:
        f.write(i+'\n')


entities = ['soccer', 'world cup', 'Messi', 'FC Barcelona']
with open ('sport_entities.txt', 'w') as f:
    for i in entities:
        f.write(i+'\n')


In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

financial_entity_extractor = BigTextMatcher() \
    .setInputCols(["document",'token'])\
    .setOutputCol("financial_entities")\
    .setStoragePath("financial_entities.txt", ReadAs.TEXT)\
    .setCaseSensitive(False)

sport_entity_extractor = BigTextMatcher() \
    .setInputCols(["document",'token'])\
    .setOutputCol("sport_entities")\
    .setStoragePath("sport_entities.txt", ReadAs.TEXT)\
    .setCaseSensitive(False)

nlpPipeline = Pipeline(
    stages=[
        documentAssembler,
        tokenizer,
        financial_entity_extractor,
        sport_entity_extractor
        ])

result = nlpPipeline.fit(news_df).transform(news_df)

In [None]:
result.select('description','financial_entities.result','sport_entities.result')\
      .toDF('text','financial_matches','sport_matches').filter((F.size('financial_matches')>1) | (F.size('sport_matches')>1))\
      .show(truncate=70)

+----------------------------------------------------------------------+----------------------------------+-------------------+
|                                                                  text|                 financial_matches|      sport_matches|
+----------------------------------------------------------------------+----------------------------------+-------------------+
|"Company launched the biggest electronic auction of stock in Wall S...|              [stock, Wall Street]|                 []|
|Google, Inc. significantly cut the expected share price for its ini...|                    [stock, stock]|                 []|
|Google, Inc. significantly cut the expected share price this mornin...|                    [stock, stock]|                 []|
| Shares of Air Canada  (AC.TO) fell by more than half on Wednesday,...|                    [Stock, stock]|                 []|
|Stock prices are lower in moderate trading. The Dow Jones Industria...|                    [Stock, Stoc