<img src="https://nlp.johnsnowlabs.com/assets/images/logo.png" width="180" height="50" style="float: left;">

## Simple Text Matching

In the following example, we walk-through our straight forward Text Matcher Annotator.

This annotator will take a list of sentences from a text file and look them up in the given target dataset.

This annotator is an Annotator Model and hence does not require training. 

### Spark `2.4` and Spark NLP `2.0.0`

#### 1. Call necessary imports and set the resource path to read local data files

In [None]:
import os
import sys

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
import time

import sparknlp
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *


#Setting location of resource Directory
resource_path= "../../../src/test/resources/"

#### 2. Load SparkSession if not already there

In [None]:
spark = sparknlp.start()

#### 3. Create appropriate annotators. We are using Sentence Detection and Tokenizing the sentence. The Finisher will clean the annotations and exclude the metadata.

In [3]:
documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = Tokenizer()\
  .setInputCols(["document"])\
  .setOutputCol("token")

extractor = TextMatcher()\
  .setEntities("entities.txt")\
  .setInputCols(["token", "sentence"])\
  .setOutputCol("entites")

finisher = Finisher() \
    .setInputCols(["entites"]) \
    .setIncludeMetadata(False) \
    .setCleanAnnotations(True)

pipeline = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    extractor,
    finisher
  ])


#### 4. Load the input data to be annotated

In [4]:
! rm /tmp/sentiment.parquet.zip
! wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment.parquet.zip -P /tmp
! unzip /tmp/sentiment.parquet.zip -d /tmp/

--2019-03-21 18:12:44--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment.parquet.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.162.13
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.162.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 76127532 (73M) [application/zip]
Saving to: ‘/tmp/sentiment.parquet.zip’


2019-03-21 18:13:02 (4.08 MB/s) - ‘/tmp/sentiment.parquet.zip’ saved [76127532/76127532]

Archive:  /tmp/sentiment.parquet.zip
replace /tmp/sentiment.parquet/.part-00002-08092d15-dd8c-40f9-a1df-641a1a4b1698.snappy.parquet.crc? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


In [5]:
data = spark. \
        read. \
        parquet("/tmp/sentiment.parquet"). \
        limit(1000).cache()
data.show(20)

+------+---------+--------------------+
|itemid|sentiment|                text|
+------+---------+--------------------+
|799033|        0|@FrankomQ8 What's...|
|799034|        1|@FranKoUK guitar ...|
|799035|        0|@frankparenteau u...|
|799036|        1|@frankparenteau w...|
|799037|        1|@FrankPatris dude...|
|799038|        0|@FrankRamblings a...|
|799039|        1|@frankroberts  ni...|
|799040|        0|@frankroberts ur ...|
|799041|        1|@FrankS Breaking ...|
|799042|        1|@frankschultelad ...|
|799043|        0|@frankshorter Wol...|
|799044|        0|@franksting - its...|
|799045|        1|@franksting Ha! D...|
|799046|        1|@franksting yeah,...|
|799047|        1|@franksting yes, ...|
|799048|        1|@FrankSylar arn't...|
|799049|        1|    @frankules WO ? |
|799050|        0|@frankwkelly I'm ...|
|799051|        1|@FrankXSalinas Th...|
|799052|        1|@frankybhoy93 tha...|
+------+---------+--------------------+
only showing top 20 rows



#### 5. Running the fit for sentence detection and tokenization.

In [6]:
print("Start fitting")
model = pipeline.fit(data)
print("Fitting is ended")

Start fitting
Fitting is ended


#### 6. Runing the transform on data to do text matching. It will append a new coloumns with matched entities.

In [7]:
extracted = model.transform(data)
extracted.show()

+------+---------+--------------------+----------------+
|itemid|sentiment|                text|finished_entites|
+------+---------+--------------------+----------------+
|799033|        0|@FrankomQ8 What's...|              []|
|799034|        1|@FranKoUK guitar ...|              []|
|799035|        0|@frankparenteau u...|              []|
|799036|        1|@frankparenteau w...|              []|
|799037|        1|@FrankPatris dude...|              []|
|799038|        0|@FrankRamblings a...|              []|
|799039|        1|@frankroberts  ni...|              []|
|799040|        0|@frankroberts ur ...|              []|
|799041|        1|@FrankS Breaking ...|              []|
|799042|        1|@frankschultelad ...|              []|
|799043|        0|@frankshorter Wol...|              []|
|799044|        0|@franksting - its...|              []|
|799045|        1|@franksting Ha! D...|              []|
|799046|        1|@franksting yeah,...|              []|
|799047|        1|@franksting y

#### 7. The model could be saved locally and reloaded to run again

In [8]:

model.write().overwrite().save("./extractor_model")

In [9]:
from pyspark.ml import  Pipeline

sameModel = PipelineModel.read().load("./extractor_model")

sameModel.transform(data).show()

+------+---------+--------------------+----------------+
|itemid|sentiment|                text|finished_entites|
+------+---------+--------------------+----------------+
|799033|        0|@FrankomQ8 What's...|              []|
|799034|        1|@FranKoUK guitar ...|              []|
|799035|        0|@frankparenteau u...|              []|
|799036|        1|@frankparenteau w...|              []|
|799037|        1|@FrankPatris dude...|              []|
|799038|        0|@FrankRamblings a...|              []|
|799039|        1|@frankroberts  ni...|              []|
|799040|        0|@frankroberts ur ...|              []|
|799041|        1|@FrankS Breaking ...|              []|
|799042|        1|@frankschultelad ...|              []|
|799043|        0|@frankshorter Wol...|              []|
|799044|        0|@franksting - its...|              []|
|799045|        1|@franksting Ha! D...|              []|
|799046|        1|@franksting yeah,...|              []|
|799047|        1|@franksting y