![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/annotation/text/english/text-matcher-pipeline/extractor.ipynb
)

# Simple Text Matching

In [None]:
# Only run this cell when you are using Spark NLP on Google Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

In the following example, we walk-through our straight forward Text Matcher Annotator.

This annotator will take a list of sentences from a text file and look them up in the given target dataset.

This annotator is an Annotator Model and hence does not require training. 

#### 1. Call necessary imports and set the resource path to read local data files

In [None]:
import os
import sys

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
import time

import sparknlp
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

In [None]:
spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)


Spark NLP version:  4.3.1
Apache Spark version:  3.0.2


#### 3. Create appropriate annotators. We are using Sentence Detection and Tokenizing the sentence. The Finisher will clean the annotations and exclude the metadata.

In [None]:
documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = Tokenizer()\
  .setInputCols(["document"])\
  .setOutputCol("token")

extractor = TextMatcher()\
  .setEntities("entities.txt")\
  .setInputCols(["token", "sentence"])\
  .setOutputCol("entites")

finisher = Finisher() \
    .setInputCols(["entites"]) \
    .setIncludeMetadata(False) \
    .setCleanAnnotations(True)

pipeline = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    extractor,
    finisher
  ])


#### 4. Load the input data to be annotated

In [None]:
! rm /tmp/sentiment.parquet.zip
! wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment.parquet.zip 
! unzip sentiment.parquet.zip 

rm: cannot remove '/tmp/sentiment.parquet.zip': No such file or directory
--2023-02-20 11:50:49--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment.parquet.zip
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.60.24, 52.216.220.80, 54.231.233.48, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.60.24|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 76127532 (73M) [application/zip]
Saving to: ‘sentiment.parquet.zip’


2023-02-20 11:50:54 (19,6 MB/s) - ‘sentiment.parquet.zip’ saved [76127532/76127532]

Archive:  sentiment.parquet.zip
   creating: sentiment.parquet/
  inflating: sentiment.parquet/.part-00002-08092d15-dd8c-40f9-a1df-641a1a4b1698.snappy.parquet.crc  
  inflating: sentiment.parquet/part-00002-08092d15-dd8c-40f9-a1df-641a1a4b1698.snappy.parquet  
  inflating: sentiment.parquet/part-00003-08092d15-dd8c-40f9-a1df-641a1a4b1698.snappy.parquet  
  in

In [None]:
data = spark. \
        read. \
        parquet("sentiment.parquet"). \
        limit(1000).cache()
data.show(20)

+------+---------+--------------------+
|itemid|sentiment|                text|
+------+---------+--------------------+
|799033|        0|@FrankomQ8 What's...|
|799034|        1|@FranKoUK guitar ...|
|799035|        0|@frankparenteau u...|
|799036|        1|@frankparenteau w...|
|799037|        1|@FrankPatris dude...|
|799038|        0|@FrankRamblings a...|
|799039|        1|@frankroberts  ni...|
|799040|        0|@frankroberts ur ...|
|799041|        1|@FrankS Breaking ...|
|799042|        1|@frankschultelad ...|
|799043|        0|@frankshorter Wol...|
|799044|        0|@franksting - its...|
|799045|        1|@franksting Ha! D...|
|799046|        1|@franksting yeah,...|
|799047|        1|@franksting yes, ...|
|799048|        1|@FrankSylar arn't...|
|799049|        1|    @frankules WO ? |
|799050|        0|@frankwkelly I'm ...|
|799051|        1|@FrankXSalinas Th...|
|799052|        1|@frankybhoy93 tha...|
+------+---------+--------------------+
only showing top 20 rows



#### 5. Running the fit for sentence detection and tokenization.

In [None]:
print("Start fitting")
model = pipeline.fit(data)
print("Fitting is ended")

Start fitting
Fitting is ended


#### 6. Runing the transform on data to do text matching. It will append a new coloumns with matched entities.

In [None]:
extracted = model.transform(data)
extracted.show()

# filter rows with extracted text
extracted\
.filter("size(finished_entites) != 0") \
.show()

+------+---------+--------------------+----------------+
|itemid|sentiment|                text|finished_entites|
+------+---------+--------------------+----------------+
|799033|        0|@FrankomQ8 What's...|              []|
|799034|        1|@FranKoUK guitar ...|[guitar lessons]|
|799035|        0|@frankparenteau u...|              []|
|799036|        1|@frankparenteau w...|              []|
|799037|        1|@FrankPatris dude...|              []|
|799038|        0|@FrankRamblings a...|              []|
|799039|        1|@frankroberts  ni...|              []|
|799040|        0|@frankroberts ur ...|              []|
|799041|        1|@FrankS Breaking ...|              []|
|799042|        1|@frankschultelad ...|              []|
|799043|        0|@frankshorter Wol...|              []|
|799044|        0|@franksting - its...|              []|
|799045|        1|@franksting Ha! D...|              []|
|799046|        1|@franksting yeah,...|              []|
|799047|        1|@franksting y

#### 7. The model could be saved locally and reloaded to run again

In [None]:

model.write().overwrite().save("./extractor_model")

In [None]:
from pyspark.ml import  Pipeline

sameModel = PipelineModel.read().load("./extractor_model")

sameModel.transform(data) \
.filter("size(finished_entites) != 0") \
.show()

+------+---------+--------------------+----------------+
|itemid|sentiment|                text|finished_entites|
+------+---------+--------------------+----------------+
|799034|        1|@FranKoUK guitar ...|[guitar lessons]|
|799065|        0|@Frannyd oh lame....|       [i think]|
|799173|        1|i am seriously sl...|       [i think]|
|799869|        1|@FrazzleYeah  yea...|       [i think]|
|799898|        0|@freakyfudge that...|       [i think]|
|799957|        1|@FreddyMallet Hi!...|       [i think]|
|800003|        0|@freecloud but i ...|       [i think]|
+------+---------+--------------------+----------------+

