![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **TokenAssembler**

This notebook will cover the different parameters and usages of `TokenAssembler`. 

**📖 Learning Objectives:**

1. Understand how it reconstructs a DOCUMENT type annotation from tokens.

3. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [TokenAssembler](https://nlp.johnsnowlabs.com/docs/en/annotators#tokenassembler)

- Scala Docs : [TokenAssembler](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/TokenAssembler)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb).

## **📜 Background**

This transformer reconstructs a DOCUMENT type annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators.

## **🎬 Colab Setup**

In [1]:
!pip install -q pyspark==3.1.2  spark-nlp==4.2.4

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m448.4/448.4 KB[0m [31m29.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 KB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [2]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql import functions as F

spark = sparknlp.start()
spark

## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`,`TOKEN`

- Output: `DOCUMENT`

## **🔎 Parameters**

*  `PreservePosition` (*Boolean*) : Whether to preserve the actual position of the tokens or reduce them to one space (Default: false).



In [3]:
# First, the text is tokenized and cleaned
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences")

tokenizer = Tokenizer() \
    .setInputCols(["sentences"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized") \
    .setLowercase(False)

stopwordsCleaner = StopWordsCleaner() \
    .setInputCols(["normalized"]) \
    .setOutputCol("cleanTokens") \
    .setCaseSensitive(False)

# Then the TokenAssembler turns the cleaned tokens into a `DOCUMENT` type structure.
tokenAssembler = TokenAssembler() \
    .setInputCols(["sentences", "cleanTokens"]) \
    .setOutputCol("cleanText")

data = spark.createDataFrame([["Spark NLP is an open-source text processing library for advanced natural language processing."]]) \
    .toDF("text")

pipeline = Pipeline().setStages([
    documentAssembler,
    sentenceDetector,
    tokenizer,
    normalizer,
    stopwordsCleaner,
    tokenAssembler
]).fit(data)

result = pipeline.transform(data)
result.select("cleanText").take(1)


[Row(cleanText=[Row(annotatorType='document', begin=0, end=80, result='Spark NLP opensource text processing library advanced natural language processing', metadata={'sentence': '0'}, embeddings=[])])]

In [4]:
result.select('text', F.explode(result.cleanText.result).alias('cleanText')).show(truncate=False)

+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------+
|text                                                                                         |cleanText                                                                        |
+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------+
|Spark NLP is an open-source text processing library for advanced natural language processing.|Spark NLP opensource text processing library advanced natural language processing|
+---------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------+



By default *`PreservePosition = False`*, so the actual position of tokens is not preserved. So it just reconstructs a DOCUMENT type annotation from these tokens after they have been normalized and stopwords being removed.


➤  PreservePosition : True


In [5]:
# Setting PreservePosition as True
tokenAssembler = TokenAssembler() \
    .setInputCols(["sentences", "cleanTokens"]) \
    .setOutputCol("cleanText")\
    .setPreservePosition(True)

data = spark.createDataFrame([["Spark NLP is an open-source text processing library for advanced natural language processing."]]) \
    .toDF("text")

pipeline = Pipeline().setStages([
    documentAssembler,
    sentenceDetector,
    tokenizer,
    normalizer,
    stopwordsCleaner,
    tokenAssembler
]).fit(data)

result = pipeline.transform(data)
result.select("cleanText").take(1)


[Row(cleanText=[Row(annotatorType='document', begin=0, end=83, result='Spark NLP   opensource text processing library  advanced natural language processing', metadata={'sentence': '0'}, embeddings=[])])]

In [6]:
result.select('text', F.explode(result.cleanText.result).alias('cleanText')).show(truncate=False)

+---------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
|text                                                                                         |cleanText                                                                           |
+---------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
|Spark NLP is an open-source text processing library for advanced natural language processing.|Spark NLP   opensource text processing library  advanced natural language processing|
+---------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+



Here as *`PreservePosition = True`*, we can see in the result that the actual position of the tokens has been preserved.

### Checking for multiple sentences 


In [7]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/jupyter/annotation/english/spark-nlp-basics/sample-sentences-en.txt

In [8]:
with open('./sample-sentences-en.txt') as f:
  print (f.read())

Peter is a very good person.
My life in Russia is very interesting.
John and Peter are brothers. However they don't support each other that much.
Lucas Nogal Dunbercker is no longer happy. He has a good car though.
Europe is very culture rich. There are huge churches! and big houses!


In [9]:
#Loading data
spark_df = spark.read.text('./sample-sentences-en.txt').toDF('text')

spark_df.show(truncate=False)

+-----------------------------------------------------------------------------+
|text                                                                         |
+-----------------------------------------------------------------------------+
|Peter is a very good person.                                                 |
|My life in Russia is very interesting.                                       |
|John and Peter are brothers. However they don't support each other that much.|
|Lucas Nogal Dunbercker is no longer happy. He has a good car though.         |
|Europe is very culture rich. There are huge churches! and big houses!        |
+-----------------------------------------------------------------------------+





➤  With Sentence Detector

In [10]:
sentenceDetector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences")


# Setting PreservePosition as True
tokenAssembler = TokenAssembler() \
    .setInputCols(["sentences", "cleanTokens"]) \
    .setOutputCol("cleanText")\
    .setPreservePosition(True)


pipeline = Pipeline().setStages([
    documentAssembler,
    sentenceDetector,
    tokenizer,
    normalizer,
    stopwordsCleaner,
    tokenAssembler
]).fit(spark_df)

result = pipeline.transform(spark_df)
result.select("cleanText").take(5)


[Row(cleanText=[Row(annotatorType='document', begin=0, end=19, result='Peter    good person', metadata={'sentence': '0'}, embeddings=[])]),
 Row(cleanText=[Row(annotatorType='document', begin=0, end=25, result='life  Russia   interesting', metadata={'sentence': '0'}, embeddings=[])]),
 Row(cleanText=[Row(annotatorType='document', begin=0, end=20, result='John  Peter  brothers', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='document', begin=29, end=57, result='However  dont support    much', metadata={'sentence': '1'}, embeddings=[])]),
 Row(cleanText=[Row(annotatorType='document', begin=0, end=36, result='Lucas Nogal Dunbercker   longer happy', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='document', begin=43, end=57, result='good car though', metadata={'sentence': '1'}, embeddings=[])]),
 Row(cleanText=[Row(annotatorType='document', begin=0, end=20, result='Europe   culture rich', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='document

In [11]:
result.select('text', F.explode(result.cleanText.result).alias('cleanText')).show(truncate=False)

+-----------------------------------------------------------------------------+-------------------------------------+
|text                                                                         |cleanText                            |
+-----------------------------------------------------------------------------+-------------------------------------+
|Peter is a very good person.                                                 |Peter    good person                 |
|My life in Russia is very interesting.                                       |life  Russia   interesting           |
|John and Peter are brothers. However they don't support each other that much.|John  Peter  brothers                |
|John and Peter are brothers. However they don't support each other that much.|However  dont support    much        |
|Lucas Nogal Dunbercker is no longer happy. He has a good car though.         |Lucas Nogal Dunbercker   longer happy|
|Lucas Nogal Dunbercker is no longer happy. He has a goo

As we have used *`Sentence Detector`* in our pipleine, it detects the sentence boundaries and we get cleanText for each of them separately. In the metadata we can see that after detecting sentences, it assigns a sentence number to it, with first sentence being initialized as 0. Also as the PreservePosition is set as True, the actual position of tokens has been preserved.

In [12]:
import pyspark.sql.functions as F

result.select("text").show(truncate=False)

result.withColumn(
    "tmp", 
    F.explode("cleanText")) \
    .select("tmp.*").select("begin","end","result","metadata.sentence").show(truncate = False)

+-----------------------------------------------------------------------------+
|text                                                                         |
+-----------------------------------------------------------------------------+
|Peter is a very good person.                                                 |
|My life in Russia is very interesting.                                       |
|John and Peter are brothers. However they don't support each other that much.|
|Lucas Nogal Dunbercker is no longer happy. He has a good car though.         |
|Europe is very culture rich. There are huge churches! and big houses!        |
+-----------------------------------------------------------------------------+

+-----+---+-------------------------------------+--------+
|begin|end|result                               |sentence|
+-----+---+-------------------------------------+--------+
|0    |19 |Peter    good person                 |0       |
|0    |25 |life  Russia   interesting      

Sentence boundaries within the text were detected.

For Example, we had the following text: 

*John and Peter are brothers. However they don't support each other that much.*

John  Peter  brothers  ▶  metadata={'sentence': '0'}

However  dont support    much ▶ metadata={'sentence': '1'}




➤  Without Sentence Detector

In [13]:
# if we hadn't used Sentence Detector, this would be what we got. (tokenizer gets document instead of sentences column)
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

tokenassembler = TokenAssembler()\
    .setInputCols(["document", "cleanTokens"]) \
    .setOutputCol("cleanText")\
    .setPreservePosition(True)

nlpPipeline = Pipeline(stages=[documentAssembler,
                               tokenizer,
                               normalizer,
                               stopwordsCleaner,
                               tokenassembler])

result = nlpPipeline.fit(spark_df).transform(spark_df)
result.select("cleanText").take(5)

[Row(cleanText=[Row(annotatorType='document', begin=0, end=19, result='Peter    good person', metadata={'sentence': '0'}, embeddings=[])]),
 Row(cleanText=[Row(annotatorType='document', begin=0, end=25, result='life  Russia   interesting', metadata={'sentence': '0'}, embeddings=[])]),
 Row(cleanText=[Row(annotatorType='document', begin=0, end=50, result='John  Peter  brothers However  dont support    much', metadata={'sentence': '0'}, embeddings=[])]),
 Row(cleanText=[Row(annotatorType='document', begin=0, end=55, result='Lucas Nogal Dunbercker   longer happy    good car though', metadata={'sentence': '0'}, embeddings=[])]),
 Row(cleanText=[Row(annotatorType='document', begin=0, end=48, result='Europe   culture rich   huge churches  big houses', metadata={'sentence': '0'}, embeddings=[])])]

In [14]:
result.select('text', F.explode(result.cleanText.result).alias('cleanText')).show(truncate=False)

+-----------------------------------------------------------------------------+--------------------------------------------------------+
|text                                                                         |cleanText                                               |
+-----------------------------------------------------------------------------+--------------------------------------------------------+
|Peter is a very good person.                                                 |Peter    good person                                    |
|My life in Russia is very interesting.                                       |life  Russia   interesting                              |
|John and Peter are brothers. However they don't support each other that much.|John  Peter  brothers However  dont support    much     |
|Lucas Nogal Dunbercker is no longer happy. He has a good car though.         |Lucas Nogal Dunbercker   longer happy    good car though|
|Europe is very culture rich. There are h

Without using a *`Sentence Detector`* in our pipleine, sentence boundaries are not detected. It treats it as a single sentence. Also as the PreservePosition is set as True, the actual position of tokens has been preserved.

In [15]:
import pyspark.sql.functions as F

result.select("text").show(truncate=False)

result.withColumn(
    "tmp", 
    F.explode("cleanText")) \
    .select("tmp.*").select("begin","end","result","metadata.sentence").show(truncate = False)

+-----------------------------------------------------------------------------+
|text                                                                         |
+-----------------------------------------------------------------------------+
|Peter is a very good person.                                                 |
|My life in Russia is very interesting.                                       |
|John and Peter are brothers. However they don't support each other that much.|
|Lucas Nogal Dunbercker is no longer happy. He has a good car though.         |
|Europe is very culture rich. There are huge churches! and big houses!        |
+-----------------------------------------------------------------------------+

+-----+---+--------------------------------------------------------+--------+
|begin|end|result                                                  |sentence|
+-----+---+--------------------------------------------------------+--------+
|0    |19 |Peter    good person              

Sentence boundaries within the text were not detected and considered as a single sentence.

`IMPORTANT NOTE`:

If you have some other steps & annotators in your pipeline that will need to use the tokens from cleaned text (assembled tokens), you will need to tokenize the processed text again as the original text is probably changed completely.