![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/text-matcher-pipeline/extractor.ipynb)

## 0. Colab Setup

In [1]:
# This is only to setup PySpark and Spark NLP on Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2022-12-23 14:46:26--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://setup.johnsnowlabs.com/colab.sh [following]
--2022-12-23 14:46:26--  https://setup.johnsnowlabs.com/colab.sh
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2022-12-23 14:46:26--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:44

## Simple Text Matching

In the following example, we walk-through our straight forward Text Matcher Annotator.

This annotator will take a list of sentences from a text file and look them up in the given target dataset.

This annotator is an Annotator Model and hence does not require training. 

#### 1. Call necessary imports and set the resource path to read local data files

In [2]:
import os
import sys

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
import time

import sparknlp
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

! wget -N https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/text-matcher-pipeline/entities.txt 

--2022-12-23 14:47:22--  https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/text-matcher-pipeline/entities.txt
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘entities.txt’

entities.txt            [ <=>                ] 149.13K  --.-KB/s    in 0.04s   

Last-modified header missing -- time-stamps turned off.
2022-12-23 14:47:22 (3.83 MB/s) - ‘entities.txt’ saved [152712]



In [3]:
spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)


Spark NLP version:  4.2.6
Apache Spark version:  3.2.3


#### 3. Create appropriate annotators. We are using Sentence Detection and Tokenizing the sentence. The Finisher will clean the annotations and exclude the metadata.

In [4]:
documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = Tokenizer()\
  .setInputCols(["document"])\
  .setOutputCol("token")

extractor = TextMatcher()\
  .setEntities("entities.txt")\
  .setInputCols(["token", "sentence"])\
  .setOutputCol("entites")

finisher = Finisher() \
    .setInputCols(["entites"]) \
    .setIncludeMetadata(False) \
    .setCleanAnnotations(True)

pipeline = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    extractor,
    finisher
  ])


#### 4. Load the input data to be annotated

In [9]:
! rm /tmp/sentiment.parquet.zip
! wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment.parquet.zip 
! unzip sentiment.parquet.zip 

rm: cannot remove '/tmp/sentiment.parquet.zip': No such file or directory
--2022-12-23 14:53:58--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment.parquet.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.134.229, 52.216.37.120, 52.216.226.235, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.134.229|:443... connected.
HTTP request sent, awaiting response... 304 Not Modified
File ‘sentiment.parquet.zip’ not modified on server. Omitting download.

Archive:  sentiment.parquet.zip
   creating: sentiment.parquet/
  inflating: sentiment.parquet/.part-00002-08092d15-dd8c-40f9-a1df-641a1a4b1698.snappy.parquet.crc  
  inflating: sentiment.parquet/part-00002-08092d15-dd8c-40f9-a1df-641a1a4b1698.snappy.parquet  
  inflating: sentiment.parquet/part-00003-08092d15-dd8c-40f9-a1df-641a1a4b1698.snappy.parquet  
  inflating: sentiment.parquet/.part-00000-08092d15-dd8c-40f9-a1df-641a1a4b1698.snappy.parquet.crc  
  inflating: sentiment.parquet/part-0

In [11]:
data = spark. \
        read. \
        parquet("/content/sentiment.parquet"). \
        limit(1000).cache()
data.show(20)

+------+---------+--------------------+
|itemid|sentiment|                text|
+------+---------+--------------------+
|     1|        0|                 ...|
|     2|        0|                 ...|
|     3|        1|              omg...|
|     4|        0|          .. Omga...|
|     5|        0|         i think ...|
|     6|        0|         or i jus...|
|     7|        1|       Juuuuuuuuu...|
|     8|        0|       Sunny Agai...|
|     9|        1|      handed in m...|
|    10|        1|      hmmmm.... i...|
|    11|        0|      I must thin...|
|    12|        1|      thanks to a...|
|    13|        0|      this weeken...|
|    14|        0|     jb isnt show...|
|    15|        0|     ok thats it ...|
|    16|        0|    &lt;-------- ...|
|    17|        0|    awhhe man.......|
|    18|        1|    Feeling stran...|
|    19|        0|    HUGE roll of ...|
|    20|        0|    I just cut my...|
+------+---------+--------------------+
only showing top 20 rows



#### 5. Running the fit for sentence detection and tokenization.

In [12]:
print("Start fitting")
model = pipeline.fit(data)
print("Fitting is ended")

Start fitting
Fitting is ended


#### 6. Runing the transform on data to do text matching. It will append a new coloumns with matched entities.

In [13]:
extracted = model.transform(data)
extracted.show()

# filter rows with extracted text
extracted\
.filter("size(finished_entites) != 0") \
.show()

+------+---------+--------------------+----------------+
|itemid|sentiment|                text|finished_entites|
+------+---------+--------------------+----------------+
|     1|        0|                 ...|              []|
|     2|        0|                 ...|              []|
|     3|        1|              omg...|              []|
|     4|        0|          .. Omga...|              []|
|     5|        0|         i think ...|              []|
|     6|        0|         or i jus...|              []|
|     7|        1|       Juuuuuuuuu...|              []|
|     8|        0|       Sunny Agai...|              []|
|     9|        1|      handed in m...|              []|
|    10|        1|      hmmmm.... i...|              []|
|    11|        0|      I must thin...|              []|
|    12|        1|      thanks to a...|              []|
|    13|        0|      this weeken...|              []|
|    14|        0|     jb isnt show...|              []|
|    15|        0|     ok thats

#### 7. The model could be saved locally and reloaded to run again

In [14]:

model.write().overwrite().save("./extractor_model")

In [15]:
from pyspark.ml import  Pipeline

sameModel = PipelineModel.read().load("./extractor_model")

sameModel.transform(data) \
.filter("size(finished_entites) != 0") \
.show()

+------+---------+----+----------------+
|itemid|sentiment|text|finished_entites|
+------+---------+----+----------------+
+------+---------+----+----------------+

