![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Finisher
In this notebook, we will examine the `Finisher` annotator and its parameters. <br/>

This annotator converts annotation results into a format that is easier to use. If you just want the desired output column in the final dataframe, we can use `Finisher` to drop previous stages in the final output and get the result from the process.  <br/>

**📖 Learning Objectives:**

1. Understand how to extract the results from Spark NLP Pipelines. 
2. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**

Documentation: [Finisher](https://nlp.johnsnowlabs.com/docs/en/annotators#finisher)

Python Docs: [Finisher](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/finisher/index.html#module-sparknlp.base.finisher)

Scala Docs: [Finisher](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/Finisher.html)

For extended example of usage, see the [Spark NLP Workshop](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb)


## **📜 Background**

Once we have our NLP pipeline ready to go, we might want to use our annotation results somewhere else where it is easy to use. The Finisher outputs annotation(s) values into a string. This is very handy when you want to use the output from Spark NLP annotator as an input to another Spark ML transformer.

## **🎬 Colab Setup**

In [1]:
# Install PySpark and Spark NLP
! pip install -q pyspark==3.1.2 spark-nlp==4.2.5

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m453.4/453.4 KB[0m [31m37.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 KB[0m [31m28.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [2]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql import functions as F

spark = sparknlp.start()
spark

## **🖨️ Input/Output Annotation Types**

- Input: `ANY` <br/>
- Output: `NONE`

## **🔎 Parameters** <br/>
- `CleanAnnotations` (*Boolean*): Sets whether to remove annotation columns, by default True.
- `IncludeMetadata` (*Boolean*): Sets whether to include annotation metadata, by default false.
- `OutputAsArray` (*Boolean*): Sets whether to generate an array with the results instead of a string, by default false. 
- `AnnotationSplitSymbol` (*String*): Sets character separating annotations, by default @.
- `ValueSplitSymbol` (*String*): Sets character separating values, by default #.





### `setCleanAnnotations`

Firstly, we will create a pipeline consisting `DocumentAssembler`, `Tokenizer`, `Normalizer` and `Finisher` to examine each parameter. <br/>

The aim of this pipeline is normalizing the given text. 

**`CleanAnnotations(True)`** <br/>
By setting this parameter as True, we do not expect to see the output of all stages in the result. 

In [3]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized")

finisher = Finisher() \
    .setInputCols(["normalized"]) \
    .setCleanAnnotations(True) 
    
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    normalizer,
    finisher
])

In [4]:
#creating a sample data
data = spark.createDataFrame([["John and Peter are brothers. However they don't support each other that much. John is 20 years old and Peter is 26"]]).toDF("text")

In [5]:
#fit and transform the pipeline
model = pipeline.fit(data)
result= model.transform(data)

Let's see what we have as an output

In [6]:
result.show(truncate=False)

+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
|text                                                                                                              |finished_normalized                                                                                                           |
+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
|John and Peter are brothers. However they don't support each other that much. John is 20 years old and Peter is 26|[John, and, Peter, are, brothers, However, they, dont, support, each, other, that, much, John, is, years, old, and, Peter, is]|
+-----------------------

In normal cases, we expect to see the outputs of all stages of the pipeline such as `DocumentAssembler`, `Tokenizer`, `Normalizer`. However, we only see the final stage of the pipeline(output of `Normalizer`) since we used `Finisher` with **`CleanAnnotations(True)`** parameter. <br/>

Also, as seen above, we see `Finisher` result as "finished_" + *input name*.

**`CleanAnnotations(False)`**

We expect to see all the stages of the pipeline in the results by setting `CleanAnnotations` as `False`

In [7]:
finisher = Finisher() \
    .setInputCols(["normalized"]) \
    .setCleanAnnotations(False) 
    
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    normalizer,
    finisher
])

In [8]:
#fit and transform the pipeline
model = pipeline.fit(data)
result= model.transform(data)

In [9]:
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|               token|          normalized| finished_normalized|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|John and Peter ar...|[{document, 0, 11...|[{token, 0, 3, Jo...|[{token, 0, 3, Jo...|[John, and, Peter...|
+--------------------+--------------------+--------------------+--------------------+--------------------+



As seen above, we see all the stages of the pipeline. 

### `setIncludeMetadata`

We can use that parameter whether to see the metadata in the output or not. 

**IncludeMetadata(True)** <br/>
By setting that parameter as True, we expect to see the metadata. 

In [10]:
finisher = Finisher() \
    .setInputCols(["normalized"]) \
    .setIncludeMetadata(True) \
    
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    normalizer,
    finisher
])

In [11]:
#fit and transform the pipeline
model = pipeline.fit(data)
result= model.transform(data)

In [12]:
result.show(truncate=70)

+----------------------------------------------------------------------+----------------------------------------------------------------------+----------------------------------------------------------------------+
|                                                                  text|                                                   finished_normalized|                                          finished_normalized_metadata|
+----------------------------------------------------------------------+----------------------------------------------------------------------+----------------------------------------------------------------------+
|John and Peter are brothers. However they don't support each other ...|[John, and, Peter, are, brothers, However, they, dont, support, eac...|[{sentence, 0}, {sentence, 0}, {sentence, 0}, {sentence, 0}, {sente...|
+----------------------------------------------------------------------+--------------------------------------------------------------------

As seen above, we have a metadata column called "finished_normalized_metadata". 

**IncludeMetadata(False)** <br/>
By setting that parameter as False, we do not expect to see the metadata. 

In [13]:
finisher = Finisher() \
    .setInputCols(["normalized"]) \
    .setIncludeMetadata(False) 
    
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    normalizer,
    finisher
])

In [14]:
#fit and transform the pipeline
model = pipeline.fit(data)
result= model.transform(data)

In [15]:
result.show(truncate=70)

+----------------------------------------------------------------------+----------------------------------------------------------------------+
|                                                                  text|                                                   finished_normalized|
+----------------------------------------------------------------------+----------------------------------------------------------------------+
|John and Peter are brothers. However they don't support each other ...|[John, and, Peter, are, brothers, However, they, dont, support, eac...|
+----------------------------------------------------------------------+----------------------------------------------------------------------+



As we expected, there is no metadata in the result. 

### `setOutputAsArray`


We can use that parameter to generate an array with the results. 

**OutputAsArray(True)** <br/>
By setting that parameter as True, we expect to see an array with the result.

In [16]:
finisher = Finisher() \
    .setInputCols(["normalized"]) \
    .setOutputAsArray(True) 
    
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    normalizer,
    finisher
])

In [17]:
#fit and transform the pipeline
model = pipeline.fit(data)
result= model.transform(data)

In [18]:
result.show(truncate=50)

+--------------------------------------------------+--------------------------------------------------+
|                                              text|                               finished_normalized|
+--------------------------------------------------+--------------------------------------------------+
|John and Peter are brothers. However they don't...|[John, and, Peter, are, brothers, However, they...|
+--------------------------------------------------+--------------------------------------------------+



As seen above, we see the the `Finisher` result as an array. 

**OutputAsArray(False)** <br/>
By setting that parameter, we expect to see the output as string. <br/>

Strings in the output are splitted with @ character as default. However, we can modify that splitter by using `AnnotationSplitSymbol` parameter which we will cover in further. 

In [19]:
finisher = Finisher() \
    .setInputCols(["normalized"]) \
    .setOutputAsArray(False) 
        
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    normalizer,
    finisher
])


In [20]:
#fit and transform the pipeline
model = pipeline.fit(data)
result= model.transform(data)

In [21]:
result.show(truncate=60)

+------------------------------------------------------------+------------------------------------------------------------+
|                                                        text|                                         finished_normalized|
+------------------------------------------------------------+------------------------------------------------------------+
|John and Peter are brothers. However they don't support e...|John@and@Peter@are@brothers@However@they@dont@support@eac...|
+------------------------------------------------------------+------------------------------------------------------------+



As seen above, we have string in the result instead of an array. 

### `setAnnotationSplitSymbol`

By using that parameter, we can choose the splitter of the annotations in the result.

**AnnotationSplitSymbol("%")** <br/>
Default character is @. Let's set it as % this time. <br/>
We will set `OutputAsArray(False)` parameter to see the result in a string format. 

In [22]:
finisher = Finisher() \
    .setInputCols(["normalized"]) \
    .setOutputAsArray(False) \
    .setAnnotationSplitSymbol("%")
        
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    normalizer,
    finisher
])


In [23]:
#fit and transform the pipeline
model = pipeline.fit(data)
result= model.transform(data)

In [24]:
result.show(truncate=60)

+------------------------------------------------------------+------------------------------------------------------------+
|                                                        text|                                         finished_normalized|
+------------------------------------------------------------+------------------------------------------------------------+
|John and Peter are brothers. However they don't support e...|John%and%Peter%are%brothers%However%they%dont%support%eac...|
+------------------------------------------------------------+------------------------------------------------------------+



As seen above, annotations splitted by % character. 

### `setValueSplitSymbol`

By using that parameter, we can set the splitter character of values in the metadata.

**ValueSplitSymbol("^")** <br/>
Default character of that parameter is #. Let's set it as ^ this time. <br/>
We will set `IncludeMetadata(True)` in order to see the effect of the `ValueSplitSymbol` parameter. 


In [25]:
finisher = Finisher() \
    .setInputCols(["normalized"]) \
    .setOutputAsArray(False) \
    .setIncludeMetadata(True) \
    .setValueSplitSymbol("^")
        
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    normalizer,
    finisher
])


In [26]:
#fit and transform the pipeline
model = pipeline.fit(data)
result= model.transform(data)

In [27]:
result.show(truncate=60)

+------------------------------------------------------------+------------------------------------------------------------+
|                                                        text|                                         finished_normalized|
+------------------------------------------------------------+------------------------------------------------------------+
|John and Peter are brothers. However they don't support e...|sentence->0^result->John@sentence->0^result->and@sentence...|
+------------------------------------------------------------+------------------------------------------------------------+



As seen above, values in the metadata were splitted by ^ character. 