![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/annotation/text/english/regex-matcher/Matching_Text_with_RegexMatcher.ipynb)


# **Matching Text with RegexMatcher**

In these examples we look at how to match text with the RegexMatcher.

## **0. Colab Setup**

In [None]:
!pip install -q pyspark==3.3.0  spark-nlp==4.3.1

In [None]:
import pyspark.sql.functions as F
from pyspark.ml import Pipeline

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

spark

Spark NLP version:  4.3.1
Apache Spark version:  3.3.0


### **Create Spark Dataframe**

In [None]:
spark_df = spark.read.text('../spark-nlp-basics/sample-sentences-en.txt').toDF('text')

spark_df.show(truncate=False)

+-----------------------------------------------------------------------------+
|text                                                                         |
+-----------------------------------------------------------------------------+
|Peter is a very good person.                                                 |
|My life in Russia is very interesting.                                       |
|John and Peter are brothers. However they don't support each other that much.|
|Lucas Nogal Dunbercker is no longer happy. He has a good car though.         |
|Europe is very culture rich. There are huge churches! and big houses!        |
+-----------------------------------------------------------------------------+



## RegexMatcher

In [None]:
! wget -q	https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed-sample.csv

pubMedDF = spark.read\
              .option("header", "true")\
              .csv("./pubmed-sample.csv")\
              .filter("AB IS NOT null")\
              .withColumnRenamed("AB", "text")\
              .drop("TI")

pubMedDF.show(truncate=50)

+--------------------------------------------------+
|                                              text|
+--------------------------------------------------+
|The human KCNJ9 (Kir 3.3, GIRK3) is a member of...|
|BACKGROUND: At present, it is one of the most i...|
|OBJECTIVE: To investigate the relationship betw...|
|Combined EEG/fMRI recording has been used to lo...|
|Kohlschutter syndrome is a rare neurodegenerati...|
|Statistical analysis of neuroimages is commonly...|
|The synthetic DOX-LNA conjugate was characteriz...|
|Our objective was to compare three different me...|
|We conducted a phase II study to assess the eff...|
|"Monomeric sarcosine oxidase (MSOX) is a flavoe...|
|We presented the tachinid fly Exorista japonica...|
|The literature dealing with the water conductin...|
|A novel approach to synthesize chitosan-O-isopr...|
|An HPLC-ESI-MS-MS method has been developed for...|
|The localizing and lateralizing values of eye a...|
|OBJECTIVE: To evaluate the effectiveness and 

In [None]:
rules = '''
renal\s\w+, started with 'renal'
cardiac\s\w+, started with 'cardiac'
\w*ly\b, ending with 'ly'
\S*\d+\S*, match any word that contains numbers
(\d+).?(\d*)\s*(mg|ml|g), match medication metrics
'''

with open('regex_rules.txt', 'w') as f:
    f.write(rules)

In [None]:
RegexMatcher().extractParamMap()

{Param(parent='RegexMatcher_d761d66f2182', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='RegexMatcher_d761d66f2182', name='strategy', doc='MATCH_FIRST|MATCH_ALL|MATCH_COMPLETE'): 'MATCH_ALL'}

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

regex_matcher = RegexMatcher()\
    .setInputCols('document')\
    .setStrategy("MATCH_ALL")\
    .setOutputCol("regex_matches")\
    .setExternalRules(path='./regex_rules.txt', delimiter=',')


nlpPipeline = Pipeline(
    stages=[
        documentAssembler,
        regex_matcher
        ])

match_df = nlpPipeline.fit(pubMedDF).transform(pubMedDF)
match_df.select('regex_matches.result').take(3)

[Row(result=['inwardly', 'family', 'spansapproximately', 'byapproximately', 'approximately', 'respectively', 'poly', 'KCNJ9', '3.3,', 'GIRK3)', 'KCNJ9', '1q21-23', '7.6', '2.2', '2.6', 'identified14', 'aVal366Ala', '8', 'KCNJ9', 'KCNJ9', '9 g']),
 Row(result=['previously', 'previously', 'intravenously', 'previously', '25', 'mg/m(2)', '1', '8', 'a3', '50', '20.0%', '(10', '50;', '95%', 'interval,10.0-33.7%).', '58.0%', '[10', '18', '50].', '(50%', '115.0', '17.3%', '52).', '25 mg']),
 Row(result=['renal failure', 'cardiac surgery', 'cardiac surgery', 'cardiac surgical', 'early', 'statistically', 'analy', '1995', '2005', '=9796).', '2.9', '11years).', '11.3%', '1105),', '7.2%', '30%', '0.0001),', '1.55,95%', '1.42-1.70,', '0.0001).'])]

In [None]:
match_df.select('text','regex_matches.result')\
        .toDF('text','matches').filter(F.size('matches')>1)\
        .show(truncate=70)

+----------------------------------------------------------------------+----------------------------------------------------------------------+
|                                                                  text|                                                               matches|
+----------------------------------------------------------------------+----------------------------------------------------------------------+
|The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activ...|[inwardly, family, spansapproximately, byapproximately, approximate...|
|BACKGROUND: At present, it is one of the most important issues for ...|[previously, previously, intravenously, previously, 25, mg/m(2), 1,...|
|OBJECTIVE: To investigate the relationship between preoperative atr...|[renal failure, cardiac surgery, cardiac surgery, cardiac surgical,...|
|Combined EEG/fMRI recording has been used to localize the generator...|[normally, significantly, effectively, analy, only, considerably