<img src="https://nlp.johnsnowlabs.com/assets/images/logo.png" width="180" height="50" style="float: left;">

## Simple Text Matching

In the following example, we walk-through our straight forward Text Matcher Annotator.

This annotator will take a list of sentences from a text file and look them up in the given target dataset.

This annotator is an Annotator Model and hence does not require training. 

### Spark `2.4` and Spark NLP `1.8.3`

#### 1. Call necessary imports and set the resource path to read local data files

In [None]:
import os
import sys
sys.path.append('../../')

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
import time

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *


#Setting location of resource Directory
resource_path= "../../../src/test/resources/"

#### 2. Load SparkSession if not already there

In [None]:
spark = SparkSession.builder \
    .appName("Text Matcher")\
    .master("local[*]")\
    .config("spark.driver.memory","4G")\
    .config("spark.driver.maxResultSize", "2G")\
    .config("spark.jars.packages", "JohnSnowLabs:spark-nlp:1.8.3")\
    .config("spark.kryoserializer.buffer.max", "500m")\
    .getOrCreate()

#### 3. Create appropriate annotators. We are using Sentence Detection and Tokenizing the sentence. The Finisher will clean the annotations and exclude the metadata.

In [None]:
documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = Tokenizer()\
  .setInputCols(["document"])\
  .setOutputCol("token")

extractor = TextMatcher()\
  .setEntities("entities.txt")\
  .setInputCols(["token", "sentence"])\
  .setOutputCol("entites")

finisher = Finisher() \
    .setInputCols(["entites"]) \
    .setIncludeMetadata(False) \
    .setCleanAnnotations(True)

pipeline = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    extractor,
    finisher
  ])


#### 4. Load the input data to be annotated

In [None]:
data = spark. \
        read. \
        parquet(resource_path+"sentiment.parquet"). \
        limit(1000).cache()
data.show(20)

#### 5. Running the fit for sentence detection and tokenization.

In [None]:
print("Start fitting")
model = pipeline.fit(data)
print("Fitting is ended")

#### 6. Runing the transform on data to do text matching. It will append a new coloumns with matched entities.

In [None]:
extracted = model.transform(data)
extracted.show()

#### 7. The model could be saved locally and reloaded to run again

In [None]:

model.write().overwrite().save("./extractor_model")

In [None]:
from pyspark.ml import  Pipeline

sameModel = PipelineModel.read().load("./extractor_model")

sameModel.transform(data).show()