![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **PySpark Tutorial-8 Custom Annotators UDF and Light Pipelines**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/PySpark/8.PySpark_CustomAnnotators_UDF_and_Lightpipelines.ipynb)

In this notebook, some special Spark NLP annotators have been performed.




### Install PySpark

In [None]:
# install PySpark
! pip install -q pyspark==3.2.0 spark-nlp

### Initializing Spark

In [None]:
import sparknlp

spark = sparknlp.start(spark32=True)
# params =>> gpu=False, spark23=False (start with spark 2.3)

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 3.4.1
Apache Spark version: 3.2.0


In [None]:
#  DO NOT FORGET WHEN YOU'RE DONE => spark.stop()

In [None]:
from sparknlp.base import *
import pandas as pd
from sparknlp.functions import *
from pyspark.sql.functions import col
from pyspark.sql.types import ArrayType, StringType

from pyspark.ml.param.shared import HasInputCol, HasOutputCol
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable  
import pyspark.sql.functions as F
import pyspark.sql.types as T 
from pyspark.sql import Row

# Annotators and Transformer Concepts

In Spark NLP, all Annotators are either Estimators or Transformers as we see in Spark ML. An Estimator in Spark ML is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model. A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer that transforms a DataFrame with features into a DataFrame with predictions. In Spark NLP, there are two types of annotators: AnnotatorApproach and AnnotatorModel AnnotatorApproach extends Estimators from Spark ML, which are meant to be trained through fit(), and AnnotatorModel extends Transformers which are meant to transform data frames through transform(). Some of Spark NLP annotators have a Model suffix and some do not. The model suffix is explicitly stated when the annotator is the result of a training process. Some annotators, such as Tokenizer are transformers but do not contain the suffix Model since they are not trained, annotators. Model annotators have a pre-trained() on its static object, to retrieve the public pre-trained version of a model. Long story short, if it trains on a DataFrame and produces a model, it’s an AnnotatorApproach; and if it transforms one DataFrame into another DataFrame through some models, it’s an AnnotatorModel (e.g. WordEmbeddingsModel) and it doesn’t take Model suffix if it doesn’t rely on a pre-trained annotator while transforming a DataFrame (e.g. Tokenizer).

In [None]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/examples/python/annotation/text/english/spark-nlp-basics/sample-sentences-en.txt

In [None]:
with open('./sample-sentences-en.txt') as f:
  print (f.read())

Peter is a very good person.
My life in Russia is very interesting.
John and Peter are brothers. However they don't support each other that much.
Lucas Nogal Dunbercker is no longer happy. He has a good car though.
Europe is very culture rich. There are huge churches! and big houses!


In [None]:
spark_df = spark.read.text('./sample-sentences-en.txt').toDF('text')

spark_df.show(truncate=False)

+-----------------------------------------------------------------------------+
|text                                                                         |
+-----------------------------------------------------------------------------+
|Peter is a very good person.                                                 |
|My life in Russia is very interesting.                                       |
|John and Peter are brothers. However they don't support each other that much.|
|Lucas Nogal Dunbercker is no longer happy. He has a good car though.         |
|Europe is very culture rich. There are huge churches! and big houses!        |
+-----------------------------------------------------------------------------+



## Spark NLP Annotators

### Document Assembler

To get through the process in Spark NLP, we need to get raw data transformed into Document type at first.

DocumentAssembler() is a special transformer that does this for us; it creates the first annotation of type Document which may be used by annotators down the road.

In [None]:
documentAssembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")\
      .setCleanupMode("shrink")

doc_df = documentAssembler.transform(spark_df)

doc_df.show(truncate=30)

+------------------------------+------------------------------+
|                          text|                      document|
+------------------------------+------------------------------+
|  Peter is a very good person.|[{document, 0, 27, Peter is...|
|My life in Russia is very i...|[{document, 0, 37, My life ...|
|John and Peter are brothers...|[{document, 0, 76, John and...|
|Lucas Nogal Dunbercker is n...|[{document, 0, 67, Lucas No...|
|Europe is very culture rich...|[{document, 0, 68, Europe i...|
+------------------------------+------------------------------+



In [None]:
doc_df.printSchema()

root
 |-- text: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)



In [None]:
doc_df.select('document.result','document.begin','document.end').show(truncate=False)

+-------------------------------------------------------------------------------+-----+----+
|result                                                                         |begin|end |
+-------------------------------------------------------------------------------+-----+----+
|[Peter is a very good person.]                                                 |[0]  |[27]|
|[My life in Russia is very interesting.]                                       |[0]  |[37]|
|[John and Peter are brothers. However they don't support each other that much.]|[0]  |[76]|
|[Lucas Nogal Dunbercker is no longer happy. He has a good car though.]         |[0]  |[67]|
|[Europe is very culture rich. There are huge churches! and big houses!]        |[0]  |[68]|
+-------------------------------------------------------------------------------+-----+----+



In [None]:
doc_df.select("document.result").take(1)

[Row(result=['Peter is a very good person.'])]

### Sentence Detector
Finds sentence bounds in raw text.
`setCustomBounds(string)`: Custom sentence separator text e.g. `["\n"]`

In [None]:
from sparknlp.annotator import *

# we feed the document column coming from Document Assembler

sentenceDetector = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentences')

In [None]:
sent_df = sentenceDetector.transform(doc_df)

sent_df.show(truncate=False)

+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|text                                                                         |document                                                                                                               |sentences                                                                                                                                                                                          |
+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+---------

In [None]:
sent_df.select('sentences.result').take(5)

[Row(result=['Peter is a very good person.']),
 Row(result=['My life in Russia is very interesting.']),
 Row(result=['John and Peter are brothers.', "However they don't support each other that much."]),
 Row(result=['Lucas Nogal Dunbercker is no longer happy.', 'He has a good car though.']),
 Row(result=['Europe is very culture rich.', 'There are huge churches!', 'and big houses!'])]

### Tokenizer

Identifies tokens with tokenization open standards. It is an **Annotator Approach, so it requires .fit()**.

A few rules will help customizing it if defaults do not fit user needs.

setExceptions(StringArray): List of tokens to not alter at all. Allows composite tokens like two worded tokens that the user may not want to split.

In [None]:
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

In [None]:
text = 'Peter Parker (Spiderman) is a nice guy and lives in New York but has no e-mail!'

spark_df = spark.createDataFrame([[text]]).toDF("text")

In [None]:
doc_df = documentAssembler.transform(spark_df)

token_df = tokenizer.fit(doc_df).transform(doc_df)

token_df.show(truncate=100)

+-------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|                                                                           text|                                                                                            document|                                                                                               token|
+-------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|Peter Parker (Spiderman) is a nice guy and lives in New York but has no e-mail!|[{document, 0, 78, Peter Parker (Spiderman) is a nice guy and lives

In [None]:
token_df.select('token.result').take(1)

[Row(result=['Peter', 'Parker', '(', 'Spiderman', ')', 'is', 'a', 'nice', 'guy', 'and', 'lives', 'in', 'New', 'York', 'but', 'has', 'no', 'e-mail', '!'])]

### Perceptron Model

POS - Part of speech tags

Averaged Perceptron model to tag words part-of-speech. Sets a POS tag to each word within a sentence.

This is the instantiated model of the PerceptronApproach. For training your own model, please see the documentation of that class.

In [None]:
pos = PerceptronModel.pretrained()\
    .setInputCols(['document', 'token'])\
    .setOutputCol('pos')

pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[OK!]


## Custom Annotator

### SentenceChecking

In [None]:
class SentenceChecking(
    Transformer, HasInputCol, HasOutputCol,
    DefaultParamsReadable, DefaultParamsWritable):
    output_annotation_type = "document"
        
    def __init__(self,f,output_annotation_type="document"):
        super(SentenceChecking, self).__init__()
        self.f = f

    def setInputCol(self, value):
        """
        Sets the value of :py:attr:`inputCol`.
        """
        return self._set(inputCol=value)

    def setOutputCol(self, value):
        """
        Sets the value of :py:attr:`outputCol`.
        """
        return self._set(outputCol=value)

    def _transform(self, dataset):
        t = Annotation.arrayType()
        out_col = self.getOutputCol()
        in_col = dataset[self.getInputCol()]
        
        return dataset.withColumn(out_col, map_annotations(self.f, t)(in_col).alias(out_col, metadata={
            'annotatorType': self.output_annotation_type}))

In [None]:
def checking_sentences(annotations):
  anns = []
  for a in annotations:
    result = a.result + " - CHECKED SENTENCE"
    anns.append(sparknlp.annotation.Annotation(a.annotator_type, a.begin, a.begin + len(result), result, a.metadata, a.embeddings))
  return anns

## Creating Pipeline with Custom Annotator

In [None]:
document_assembler = DocumentAssembler()\
                    .setInputCol("text")\
                    .setOutputCol("document")

sentence_detector = SentenceDetector()\
                    .setInputCols(['document'])\
                    .setOutputCol('sentences')

sentence_checker = SentenceChecking(f=checking_sentences, output_annotation_type="document")\
                    .setInputCol("sentences")\
                    .setOutputCol("checked_sentences")

tokenizer = Tokenizer()\
                    .setInputCols(["checked_sentences"])\
                    .setOutputCol("tokens")

pipeline = Pipeline(stages=[document_assembler,
                            sentence_detector,
                            sentence_checker,
                            tokenizer
                           ])

In [None]:
test_string = "This is a sample text with multiple sentences. It aims to show our custom annotator problem."

test_data = spark.createDataFrame([[test_string]]).toDF("text")

In [None]:
%%time

fitted_pipeline = pipeline.fit(test_data)

spark_results = fitted_pipeline.transform(test_data)

CPU times: user 83.6 ms, sys: 12.1 ms, total: 95.7 ms
Wall time: 464 ms


In [None]:
%%time
spark_results.show(truncate=False)

+--------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
%%time

spark_results.select("checked_sentences").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|checked_sentences                                                                                                                                                                                                      |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 65, This is a sample text with multiple sentences. - CHECKED SENTENCE, {sentence -> 0}, []}, {document, 47, 111, It aims to show our custom annotator problem. - CHECKED SENTENCE, {sentence -> 1}, []}]|
+-------------------------------------------------------------------------------------------------------------------------------

In [None]:
spark_results.select("tokens.result").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                       |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[This, is, a, sample, text, with, multiple, sentences, ., -, CHECKED, SENTENCE, It, aims, to, show, our, custom, annotator, problem, ., -, CHECKED, SENTENCE]|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+



## LightPipeline

Spark NLP LightPipelines are Spark ML pipelines converted into a single machine but the multi-threaded task, becoming more than **10x times faster** for smaller amounts of data (small is relative, but 50k sentences are roughly a good maximum). To use them, we simply plug in a trained (fitted) pipeline and then annotate a plain text. We don't even need to convert the input text to DataFrame in order to feed it into a pipeline that's accepting DataFrame as an input in the first place. This feature would be quite useful when it comes to getting a prediction for a few lines of text from a trained ML model. Here is the medium post [Spark NLP 101: LightPipeline](https://medium.com/spark-nlp/spark-nlp-101-lightpipeline-a544e93f20f1)

In [None]:
document_assembler = DocumentAssembler()\
                    .setInputCol("text")\
                    .setOutputCol("document")

sentence_detector = SentenceDetector()\
                    .setInputCols(['document'])\
                    .setOutputCol('sentences')

tokenizer = Tokenizer()\
                    .setInputCols(["sentences"])\
                    .setOutputCol("token")

pos = PerceptronModel.pretrained()\
    .setInputCols(['document', 'token'])\
    .setOutputCol('pos')

pipeline = Pipeline(stages=[document_assembler,
                            sentence_detector,
                            tokenizer,
                            pos
                           ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[OK!]


**IMPORTANT!**  
In Lightpipelines, you can not use Custom annotators

In [None]:
from sparknlp.base import LightPipeline

light_model = LightPipeline(model)

In [None]:
light_result = light_model.annotate("John and Peter are brothers. However they don't support each other that much.")

In [None]:
list(zip(light_result['token'], light_result['pos']))

[('John', 'NNP'),
 ('and', 'CC'),
 ('Peter', 'NNP'),
 ('are', 'VBP'),
 ('brothers', 'NNS'),
 ('.', '.'),
 ('However', 'RB'),
 ('they', 'PRP'),
 ("don't", 'VBP'),
 ('support', 'VB'),
 ('each', 'DT'),
 ('other', 'JJ'),
 ('that', 'IN'),
 ('much', 'JJ'),
 ('.', '.')]

-------------
# Spark NLP Annotation UDFs

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")\

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

pos = PerceptronModel.pretrained()\
    .setInputCols(['document', 'token'])\
    .setOutputCol('pos')

pipeline = Pipeline().setStages([
    documentAssembler, 
    tokenizer, 
    pos])

pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[OK!]


In [None]:
data = spark.read.text('./sample-sentences-en.txt').toDF('text')

data.show(truncate = False)

+-----------------------------------------------------------------------------+
|text                                                                         |
+-----------------------------------------------------------------------------+
|Peter is a very good person.                                                 |
|My life in Russia is very interesting.                                       |
|John and Peter are brothers. However they don't support each other that much.|
|Lucas Nogal Dunbercker is no longer happy. He has a good car though.         |
|Europe is very culture rich. There are huge churches! and big houses!        |
+-----------------------------------------------------------------------------+



In [None]:
model = pipeline.fit(data)

In [None]:
result = model.transform(data)

In [None]:
result.show(5)

+--------------------+--------------------+--------------------+--------------------+
|                text|            document|               token|                 pos|
+--------------------+--------------------+--------------------+--------------------+
|Peter is a very g...|[{document, 0, 27...|[{token, 0, 4, Pe...|[{pos, 0, 4, NNP,...|
|My life in Russia...|[{document, 0, 37...|[{token, 0, 1, My...|[{pos, 0, 1, PRP$...|
|John and Peter ar...|[{document, 0, 76...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,...|
|Lucas Nogal Dunbe...|[{document, 0, 67...|[{token, 0, 4, Lu...|[{pos, 0, 4, NNP,...|
|Europe is very cu...|[{document, 0, 68...|[{token, 0, 5, Eu...|[{pos, 0, 5, NNP,...|
+--------------------+--------------------+--------------------+--------------------+



In [None]:
result.select('pos').show(1, truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|pos                                                                                                                                                                                                                                                                                                                                                                             |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
@udf( StringType())
def nn_annotation(res,meta):
    nn = []
    for i,j in zip(res,meta):
      if i == "NN" or i == "NNP":
        nn.append(j["word"])
    return nn    

In [None]:
result.withColumn("nn & NNp tokens", nn_annotation(col("pos.result"), col("pos.metadata")))\
      .select("nn & NNp tokens")\
      .show(truncate=False)

+-------------------------------+
|nn & NNp tokens                |
+-------------------------------+
|[Peter, person]                |
|[life, Russia]                 |
|[John, Peter]                  |
|[Lucas, Nogal, Dunbercker, car]|
|[Europe]                       |
+-------------------------------+

