![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Use pretrained `match_pattern` Pipeline

### Spark `2.4` and Spark NLP `2.0.0`

* DocumentAssembler
* SentenceDetector
* Tokenizer
* RegexMatcher (match phone numbers)


In [1]:
import sys
sys.path.append('../../')

#Spark ML and SQL
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql.functions import array_contains
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
#Spark NLP
from sparknlp.annotator import *
from sparknlp.common import RegexRule
from sparknlp.base import DocumentAssembler, Finisher

### Let's create a Spark Session for our app

In [2]:
spark = SparkSession.builder \
    .appName("Training_SentimentDetector")\
    .master("local[*]")\
    .config("spark.driver.memory","8G")\
    .config("spark.driver.maxResultSize", "2G")\
    .config("spark.jars", "/tmp/sparknlp.jar")\
    .config("spark.driver.extraClassPath", "/tmp/sparknlp.jar")\
    .config("spark.executor.extraClassPath", "/tmp/sparknlp.jar")\
    .config("spark.kryoserializer.buffer.max", "500m")\
    .getOrCreate()

In [3]:
spark.version

'2.4.0'

This is our testing DataFrame where we get some sentences with Phone numbers. We'll use this `transform` to predict phone numbers and extract them.

In [4]:
dfTest = spark.createDataFrame([
    "This is my phone number: +30648992549",
    "You should called Mr. Jon Doe at +33 1 79 01 22 89",
    "Ring me up dude! +1-334-179-1466"
], StringType()).toDF("text")

This Pipeline can extract `phone numbers` in these formats:
```
0689912549
+33698912549
+33 6 79 91 25 49
+33-6-79-91-25-49
(555)-555-5555
555-555-5555
+1-238 6 79 91 25 49
+1-555-532-3455
+15555323455
+7 06 79 91 25 49
```

In [5]:
pipeline = PipelineModel.load("/tmp/match_pattern_en_1.8.0_2.4_1552649261389")

In [6]:
# Of course you can select multiple columns at the same time however, this way we see each annotator without truncating their results
pipeline.transform(dfTest).select("token.result").show(truncate=False)
pipeline.transform(dfTest).select("regex.result").show(truncate=False)

# Print the schema of the Pipeline
pipeline.transform(dfTest).printSchema()

+--------------------------------------------------------------------+
|result                                                              |
+--------------------------------------------------------------------+
|[This, is, my, phone, number, :, +, 30648992549]                    |
|[You, should, called, Mr, ., Jon, Doe, at, +, 33, 1, 79, 01, 22, 89]|
|[Ring, me, up, dude, !, +, 1-334-179-1466]                          |
+--------------------------------------------------------------------+

+-------------------+
|result             |
+-------------------+
|[+30648992549]     |
|[+33 1 79 01 22 89]|
|[+1-334-179-1466]  |
+-------------------+

root
 |-- text: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metad