![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/RegexMatcherInternal.ipynb)

#   **📜 RegexMatcher**

This notebook will cover the usage of RegexMatcher.

**📖 Learning Objectives:**

1. Understand how to match exact phrases by using regex patterns.

2. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**

For extended examples of usage, see the [Spark NLP Workshop Repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/40.Rule_Based_Entity_Matchers.ipynb)

- Python Docs : [RegexMatcherInternal](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/regex/regex_matcher/index.html)

- Scala Docs : [RegexMatcherInternal](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/regex/RegexMatcherInternal.html)

Reference Documentation: [RegexMatcherInternal](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#regexmatcherinternal)

## **📜 Background**

The **`RegexMatcherInternal`** class implements an internal annotator approach to match a set of regular expressions with a provided entity. This approach is utilized for associating specific patterns within text data with predetermined entities, such as dates, mentioned within the text.

The class allows users to define rules using regular expressions paired with entities, offering flexibility in customization. These rules can either be directly set using the `setRules` method, with a specified delimiter, or loaded from an external file using the `setExternalRules` method.

Additionally, users can specify parameters such as the matching strategy (`MATCH_FIRST`, `MATCH_ALL`, or `MATCH_COMPLETE`) to control how matches are handled. The output annotation type is `CHUNK`, with input annotation types supporting `DOCUMENT`. This class provides a versatile tool for implementing entity recognition based on user-defined patterns within text data.


A rule consists of a regex pattern and an identifier, delimited by a character of choice. An example could be `"\\d{4}\\/\\d\\d\\/\\d\\d,date"` which will match strings like `"1970/01/01"` to the identifier `"date"`.

## **🎬 Colab Setup**

In [1]:
! pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.2/265.2 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 kB[0m [31m29.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.2/676.2 kB[0m [31m29.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m50.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.2/139.2 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m7.0 

In [2]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving 5.3.3.spark_nlp_for_healthcare.json to 5.3.3.spark_nlp_for_healthcare.json


In [3]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
🚨 Outdated Medical Secrets in license file. Version=5.3.3 but should be Version=5.3.2
🚨 Outdated OCR Secrets in license file. Version=5.1.2 but should be Version=5.3.2
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.3.2-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.3.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.3.2.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False 

In [4]:
import pandas as pd
import pyspark.sql.functions as F

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()
spark

👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, running on ⚡ PySpark==3.4.0


## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`

- Output: `CHUNK`

##  **🔎 Parameters**

**Parameters**:

- `strategy`: Can be either `MATCH_FIRST`, `MATCH_ALL`, `MATCH_COMPLETE`, by default `MATCH_ALL`.
- `rules`: Regex rules to match the identifier with.
- `delimiter`: Delimiter for rules provided with setRules.
- `externalRules`: external resource to rules, needs `delimiter` in options.

### IP and DATE

In [5]:
text = """Name : Hendrickson, Ora, Record date: 2093-01-13, MR #719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93
Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no: A334455B.
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com."""

data = spark.createDataFrame([[text]]).toDF("text")

### setExternalRules()

This method sets external resource to rules, it needs ‘delimiter’ in options.

In [6]:
!mkdir -p rules

rules = '''
(\d{1,3}\.){3}\d{1,3}~IPADDR
\d{4}-\d{2}-\d{2}|\d{2}/\d{2}/\d{2}|\d{2}/\d{2}/\d{2}~DATE
'''

with open('./rules/regex_rules.txt', 'w') as f:
    f.write(rules)

In [7]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

regex_matcher_internal = medical.RegexMatcher()\
    .setInputCols('document')\
    .setOutputCol("regex_matches")\
    .setExternalRules(path='./rules/regex_rules.txt', delimiter='~')

nlpPipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        regex_matcher_internal
])

result = nlpPipeline.fit(data).transform(data)

In [8]:
result.select('regex_matches.result','regex_matches.metadata').show(truncate=70)

+--------------------------------------+----------------------------------------------------------------------+
|                                result|                                                              metadata|
+--------------------------------------+----------------------------------------------------------------------+
|[2093-01-13, 203.120.223.13, 01/13/93]|[{entity -> DATE, chunk -> 0, sentence -> 0}, {entity -> IPADDR, ch...|
+--------------------------------------+----------------------------------------------------------------------+



In [9]:
result_df = result.select(F.explode(F.arrays_zip(result.regex_matches.result,
                                                 result.regex_matches.begin,
                                                 result.regex_matches.end,
                                                 result.regex_matches.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("regex_result"),
                          F.expr("cols['1']").alias("begin"),
                          F.expr("cols['2']").alias("end"),
                          F.expr("cols['3']['entity']").alias("ner_label"))
result_df.show()

+--------------+-----+---+---------+
|  regex_result|begin|end|ner_label|
+--------------+-----+---+---------+
|    2093-01-13|   38| 47|     DATE|
|203.120.223.13|   97|110|   IPADDR|
|      01/13/93|  188|195|     DATE|
+--------------+-----+---+---------+



### setStrategy()

This method sets matching strategy, by default it is “MATCH_ALL”. Strategy can be either MATCH_FIRST|MATCH_ALL|MATCH_COMPLETE.

In [10]:
# setStrategy() by default MATCH_ALL
regex_matcher_internal = medical.RegexMatcher()\
    .setInputCols('document')\
    .setStrategy("MATCH_FIRST")\
    .setOutputCol("regex_matches")\
    .setExternalRules(path='./rules/regex_rules.txt', delimiter='~')

nlpPipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        regex_matcher_internal
])

result = nlpPipeline.fit(data).transform(data)

In [11]:
result_df = result.select(F.explode(F.arrays_zip(result.regex_matches.result,
                                                 result.regex_matches.begin,
                                                 result.regex_matches.end,
                                                 result.regex_matches.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("regex_result"),
                          F.expr("cols['1']").alias("begin"),
                          F.expr("cols['2']").alias("end"),
                          F.expr("cols['3']['entity']").alias("ner_label"))
result_df.show()

+--------------+-----+---+---------+
|  regex_result|begin|end|ner_label|
+--------------+-----+---+---------+
|    2093-01-13|   38| 47|     DATE|
|203.120.223.13|   97|110|   IPADDR|
+--------------+-----+---+---------+



### setDelimiter() and setRules()

`setRules()` sets the regex rules to match the entity with. The rules must consist of a regex pattern and an entity for that pattern. The regex pattern and the entity must be delimited by a character that will also have to set with `setDelimiter()`.

In [12]:
regex_matcher_internal = medical.RegexMatcher()\
    .setInputCols('document')\
    .setRules(["(\d{1,3}\.){3}\d{1,3}~IPADDR", "\d{4}-\d{2}-\d{2}|\d{2}/\d{2}/\d{2}|\d{2}/\d{2}/\d{2}~DATE"])\
    .setDelimiter("~") \
    .setStrategy("MATCH_ALL")\
    .setOutputCol("regex_matches")\

nlpPipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        regex_matcher_internal
])

result = nlpPipeline.fit(data).transform(data)

In [13]:
result_df = result.select(F.explode(F.arrays_zip(result.regex_matches.result,
                                                 result.regex_matches.begin,
                                                 result.regex_matches.end,
                                                 result.regex_matches.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("regex_result"),
                          F.expr("cols['1']").alias("begin"),
                          F.expr("cols['2']").alias("end"),
                          F.expr("cols['3']['entity']").alias("ner_label"))
result_df.show()

+--------------+-----+---+---------+
|  regex_result|begin|end|ner_label|
+--------------+-----+---+---------+
|    2093-01-13|   38| 47|     DATE|
|203.120.223.13|   97|110|   IPADDR|
|      01/13/93|  188|195|     DATE|
+--------------+-----+---+---------+



### AGE

- `(?i)(?<=((age of|age)))(\d{1,3})~AGE` this rule inludes prefixes for `age of` and `age`
- `(\d{1,3})(?i)(?=(([^a-z0-9]{0,3}(-years-old|years-old|-year-old))))~AGE` this rule includes suffixes for `-years-old`, `years-old`, and `-year-old`


In [14]:
regex_rules_age = """
(?i)(?<=((age of|age)))(\d{1,3})~AGE
(\d{1,3})(?i)(?=(([^a-z0-9]{0,3}(-years-old|years-old|-year-old))))~AGE
"""
with open('./rules/age_rules.txt', 'w') as f:
    f.write(regex_rules_age)

In [15]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

regex_matcher_internal = medical.RegexMatcher()\
    .setInputCols('document')\
    .setStrategy("MATCH_ALL")\
    .setOutputCol("regex_matches")\
    .setExternalRules('./rules/age_rules.txt', delimiter='~')


nlpPipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        regex_matcher_internal
])

result = nlpPipeline.fit(data).transform(data)

In [16]:
result.select('regex_matches.result','regex_matches.metadata').show(truncate=70)


+------+--------------------------------------------+
|result|                                    metadata|
+------+--------------------------------------------+
|  [60]|[{entity -> AGE, chunk -> 0, sentence -> 0}]|
+------+--------------------------------------------+



In [17]:
result = result.select(F.explode(F.arrays_zip(result.regex_matches.result,
                                              result.regex_matches.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("regex_result"),
                          F.expr("cols['1']['entity']").alias("ner_label")).show()

+------------+---------+
|regex_result|ner_label|
+------------+---------+
|          60|      AGE|
+------------+---------+



## Create Custom Regex Rules

Let's create custom regex rules and use them in a pipeline.

In [18]:
def write_rule(file_path, rules):

    prefix_rule = "(?i)(?<=((({})[^a-z0-9]{})))"
    prefix_rule_init = "(?i)(?<=(({})))"
    suffix_rule = "(?i)(?=(([^a-z0-9]{}({}))))"
    suffix_rule_init = "(?i)(?=(({})))"
    with open(file_path, 'w') as f:

        for label in list(rules.keys()):
            if len(rules[label]['prefix'])>0:
                rule = prefix_rule_init.format("|".join(rules[label]['prefix'])) + rules[label]['rule'] + f"~{label}"
                f.write(rule)
                f.write('\n')
                for i in range(1,rules[label]['contextLength']):
                    rule = prefix_rule.format("|".join(rules[label]['prefix']),'{'+str(i)+'}') + rules[label]['rule'] + f"~{label}"
                    f.write(rule)
                    f.write('\n')
            try:

                if len(rules[label]['suffix'])>0:
                    rule = suffix_rule_init.format("|".join(rules[label]['suffix'])) + rules[label]['rule'] + f"~{label}"
                    f.write(rule)
                    f.write('\n')
                    for i in range(1,rules[label]['contextLength']):
                        rule = rules[label]['rule'] + suffix_rule.format('{'+str(i)+'}', "|".join(rules[label]['suffix'])) + f"~{label}"
                        f.write(rule)
                        f.write('\n')
            except:
                continue

In [19]:
!mkdir -p rules regex_models

### SSN

In [20]:
rule_path = "rules/ssn_regex_rule.txt"
model_path = "regex_models/ssn_regex_parser_model"
context_length = 3

regex_rules_ssn = {
    'SSN' :
        {
            'rule' : '(\d{3}-\d{2}-\d{4})',
            'prefix' : [
                "social", "security", "ss#", "ssn#","ssid", "ss #", "ssn #", "SSA Number", "social security number",
                "social security #", "social security#", "social security no","Soc Sec", "SSN", "SSNS", "SSN#", "SS#", "SSID"
                ],
            'label' : 'SSN',
            'contextLength' : context_length
        }
}

write_rule(rule_path, regex_rules_ssn)

ssn_regex_matcher = medical.RegexMatcher()\
    .setExternalRules(rule_path,  "~") \
    .setInputCols(["document"]) \
    .setOutputCol("ssn_matched_text") \
    .setStrategy("MATCH_ALL")

regex_parser_pipeline = nlp.Pipeline(
    stages=[
      document_assembler,
      ssn_regex_matcher
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

regex_parser_model = regex_parser_pipeline.fit(empty_data)
regex_parser_model.stages[-1].write().overwrite().save(model_path)

### AGE

In [21]:
rule_path = "rules/age_regex_rule.txt"
model_path = "regex_models/age_regex_parser_model"

with open(rule_path, 'w') as f:
    f.write("""(?i)(?<=((age of|age)))(\d{1,3})~AGE
               (\d{1,3})(?i)(?=(([^a-z0-9]{0,3}(-years-old|years-old|-year-old))))~AGE""")


age_regex_matcher = medical.RegexMatcher()\
    .setExternalRules(rule_path,  "~") \
    .setInputCols(["document"]) \
    .setOutputCol("age_matched_text") \
    .setStrategy("MATCH_ALL")

regex_parser_pipeline = nlp.Pipeline(
    stages=[
      document_assembler,
      age_regex_matcher
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

regex_parser_model = regex_parser_pipeline.fit(empty_data)
regex_parser_model.stages[-1].write().overwrite().save(model_path)

### MAIL

In [22]:
rule_path = "rules/mail_regex_rule.txt"
model_path = "regex_models/mail_regex_parser_model"

with open(rule_path, 'w') as f:
    f.write("""[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}~EMAIL""")


mail_regex_matcher = medical.RegexMatcher()\
    .setExternalRules(rule_path,  "~") \
    .setInputCols(["document"]) \
    .setOutputCol("mail_matched_text") \
    .setStrategy("MATCH_ALL")

regex_parser_pipeline = nlp.Pipeline(
    stages=[
      document_assembler,
      mail_regex_matcher
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

regex_parser_model = regex_parser_pipeline.fit(empty_data)
regex_parser_model.stages[-1].write().overwrite().save(model_path)

### PHONE

In [23]:
rule_path = "rules/phone_regex_rule.txt"
model_path = "regex_models/phone_regex_parser_model"

with open(rule_path, 'w') as f:
    f.write("""\(\d{3}\) \d{3}-\d{4}~PHONE""")


phone_regex_matcher = medical.RegexMatcher()\
    .setExternalRules(rule_path,  "~") \
    .setInputCols(["document"]) \
    .setOutputCol("phone_matched_text") \
    .setStrategy("MATCH_ALL")

regex_parser_pipeline = nlp.Pipeline(
    stages=[
      document_assembler,
      phone_regex_matcher
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

regex_parser_model = regex_parser_pipeline.fit(empty_data)
regex_parser_model.stages[-1].write().overwrite().save(model_path)

### Built Pipeline

Here is a pipeline that uses 4 different regex_matcher's. We will merge all these matched texts into ner_chunk output.

In [24]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")

regex_matcher_ssn = medical.RegexMatcherModel.load("regex_models/ssn_regex_parser_model")\
    .setInputCols(["sentence"]) \
    .setOutputCol("ssn_matched_text")

regex_matcher_age = medical.RegexMatcherModel.load("regex_models/age_regex_parser_model")\
    .setInputCols(["sentence"]) \
    .setOutputCol("age_matched_text")

regex_matcher_mail = medical.RegexMatcherModel.load("regex_models/mail_regex_parser_model")\
    .setInputCols(["sentence"]) \
    .setOutputCol("mail_matched_text")

regex_matcher_phone = medical.RegexMatcherModel.load("regex_models/phone_regex_parser_model")\
    .setInputCols(["sentence"]) \
    .setOutputCol("phone_matched_text")

chunk_merge = medical.ChunkMergeApproach()\
      .setInputCols("ssn_matched_text",
                    "age_matched_text",
                    "mail_matched_text",
                    "phone_matched_text")\
      .setOutputCol("ner_chunk")\
      .setMergeOverlapping(True)\
      .setChunkPrecedence("field")

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]


In [25]:
nlpPipeline = nlp.Pipeline(stages=[
      document_assembler,
      sentence_detector,
      tokenizer,
      regex_matcher_ssn,
      regex_matcher_age,
      regex_matcher_mail,
      regex_matcher_phone,
      chunk_merge
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

regex_pipeline_model = nlpPipeline.fit(empty_data)
light_model = nlp.LightPipeline(regex_pipeline_model)

### FullAnnotate

We can check the results with the LightPipeline and  `.fullAnnotate()´:

In [26]:
text = """Name : Hendrickson, Ora, Record date: 2093-01-13, MR #719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13.
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93.
Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no: A334455B.
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com."""

result = light_model.fullAnnotate(text)

In [27]:
result[0].keys()

dict_keys(['document', 'ner_chunk', 'phone_matched_text', 'age_matched_text', 'ssn_matched_text', 'token', 'sentence', 'mail_matched_text'])

In [28]:
result[0]['ner_chunk']

[Annotation(chunk, 121, 122, 60, {'entity': 'AGE', 'chunk': '0', 'sentence': '2'}, []),
 Annotation(chunk, 239, 249, 333-44-6666, {'entity': 'SSN', 'chunk': '1', 'sentence': '3'}, []),
 Annotation(chunk, 289, 302, (302) 786-5227, {'entity': 'PHONE', 'chunk': '2', 'sentence': '4'}, []),
 Annotation(chunk, 347, 361, smith@gmail.com, {'entity': 'EMAIL', 'chunk': '3', 'sentence': '4'}, [])]

In [29]:
ner_chunk = []
ner_label = []
begin = []
end = []

for n in result[0]['ner_chunk']:

    begin.append(n.begin)
    end.append(n.end)
    ner_chunk.append(n.result)
    ner_label.append(n.metadata['entity'])


df = pd.DataFrame({'ner_chunk':ner_chunk, 'begin': begin, 'end':end,
                   'ner_label':ner_label})

df

Unnamed: 0,ner_chunk,begin,end,ner_label
0,60,121,122,AGE
1,333-44-6666,239,249,SSN
2,(302) 786-5227,289,302,PHONE
3,smith@gmail.com,347,361,EMAIL


### Transform

Let's transform our data on the model and get the results.

In [30]:
empty_data = spark.createDataFrame([[text]]).toDF("text")

result = regex_pipeline_model.transform(empty_data)
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|    ssn_matched_text|    age_matched_text|   mail_matched_text|  phone_matched_text|           ner_chunk|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Name : Hendrickso...|[{document, 0, 36...|[{document, 0, 60...|[{token, 0, 3, Na...|[{chunk, 239, 249...|[{chunk, 121, 122...|[{chunk, 347, 361...|[{chunk, 289, 302...|[{chunk, 121, 122...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [31]:
result = result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                              result.ner_chunk.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("ner_chunk"),
                          F.expr("cols['1']['entity']").alias("ner_label")).show()

+---------------+---------+
|      ner_chunk|ner_label|
+---------------+---------+
|             60|      AGE|
|    333-44-6666|      SSN|
| (302) 786-5227|    PHONE|
|smith@gmail.com|    EMAIL|
+---------------+---------+



### Pretrained Pipeline

After setting the pipeline, we can save it locally and then use it as a PretrainedPipeline whenever needed.

In [32]:
regex_pipeline_model.write().overwrite().save("regex_pipeline_model")

In [33]:
from sparknlp.pretrained import PretrainedPipeline

regex_pipeline_loaded = PretrainedPipeline.from_disk("regex_pipeline_model")

In [34]:
result = regex_pipeline_loaded.fullAnnotate(text)
result[0].keys()

dict_keys(['document', 'ner_chunk', 'phone_matched_text', 'age_matched_text', 'ssn_matched_text', 'token', 'sentence', 'mail_matched_text'])

In [35]:
ner_chunk = []
ner_label = []
begin = []
end = []

for n in result[0]['ner_chunk']:

    begin.append(n.begin)
    end.append(n.end)
    ner_chunk.append(n.result)
    ner_label.append(n.metadata['entity'])


df = pd.DataFrame({'ner_chunk':ner_chunk,
                   'begin': begin, 'end':end,
                   'ner_label':ner_label})

df

Unnamed: 0,ner_chunk,begin,end,ner_label
0,60,121,122,AGE
1,333-44-6666,239,249,SSN
2,(302) 786-5227,289,302,PHONE
3,smith@gmail.com,347,361,EMAIL
