![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/annotation/text/english/token-assembler/Assembling_Tokens_to_Documents.ipynb)


# **Assembling Tokens to Documents**
In this example we will take a look how to use the TokenAssembler.

## **0. Colab Setup**

In [None]:
!pip install -q pyspark==3.3.0  spark-nlp==4.3.1

In [None]:
import pyspark.sql.functions as F
from pyspark.ml import Pipeline

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

spark

Spark NLP version:  4.3.1
Apache Spark version:  3.3.0


### **Create Spark Dataframe**

In [None]:
spark_df = spark.read.text('../spark-nlp-basics/sample-sentences-en.txt').toDF('text')

spark_df.show(truncate=False)

+-----------------------------------------------------------------------------+
|text                                                                         |
+-----------------------------------------------------------------------------+
|Peter is a very good person.                                                 |
|My life in Russia is very interesting.                                       |
|John and Peter are brothers. However they don't support each other that much.|
|Lucas Nogal Dunbercker is no longer happy. He has a good car though.         |
|Europe is very culture rich. There are huge churches! and big houses!        |
+-----------------------------------------------------------------------------+



## Token Assembler

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentences')

tokenizer = Tokenizer() \
    .setInputCols(["sentences"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized")\
    .setLowercase(False)\

stopwords_cleaner = StopWordsCleaner()\
    .setInputCols("normalized")\
    .setOutputCol("cleanTokens")\
    .setCaseSensitive(False)\

tokenassembler = TokenAssembler()\
    .setInputCols(["sentences", "cleanTokens"]) \
    .setOutputCol("clean_text")


nlpPipeline = Pipeline(stages=[documentAssembler,
                               sentenceDetector,
                               tokenizer,
                               normalizer,
                               stopwords_cleaner,
                               tokenassembler])

result = nlpPipeline.fit(spark_df).transform(spark_df)
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|           sentences|               token|          normalized|         cleanTokens|          clean_text|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Peter is a very g...|[{document, 0, 27...|[{document, 0, 27...|[{token, 0, 4, Pe...|[{token, 0, 4, Pe...|[{token, 0, 4, Pe...|[{document, 0, 16...|
|My life in Russia...|[{document, 0, 37...|[{document, 0, 37...|[{token, 0, 1, My...|[{token, 0, 1, My...|[{token, 3, 6, li...|[{document, 0, 22...|
|John and Peter ar...|[{document, 0, 76...|[{document, 0, 27...|[{token, 0, 3, Jo...|[{token, 0, 3, Jo...|[{token, 0, 3, Jo...|[{document, 0, 18...|
|Lucas Nogal Dunbe...|[{document, 0, 67...|[{document, 0, 41...|[{token, 0, 4, Lu...|[{token, 0, 4, Lu...|

In [None]:
# if we use TokenAssembler().setPreservePosition(True), the original borders will be preserved (dropped & unwanted chars will be replaced by spaces)
result.select('clean_text').take(1)

[Row(clean_text=[Row(annotatorType='document', begin=0, end=16, result='Peter good person', metadata={'sentence': '0'}, embeddings=[])])]

In [None]:
result.select('text', F.explode(result.clean_text.result).alias('clean_text')).show(truncate=False)

+-----------------------------------------------------------------------------+-----------------------------------+
|text                                                                         |clean_text                         |
+-----------------------------------------------------------------------------+-----------------------------------+
|Peter is a very good person.                                                 |Peter good person                  |
|My life in Russia is very interesting.                                       |life Russia interesting            |
|John and Peter are brothers. However they don't support each other that much.|John Peter brothers                |
|John and Peter are brothers. However they don't support each other that much.|However dont support much          |
|Lucas Nogal Dunbercker is no longer happy. He has a good car though.         |Lucas Nogal Dunbercker longer happy|
|Lucas Nogal Dunbercker is no longer happy. He has a good car though.   

In [None]:
result.select('text', F.explode(result.clean_text.result).alias('clean_text')).toPandas()

Unnamed: 0,text,clean_text
0,Peter is a very good person.,Peter good person
1,My life in Russia is very interesting.,life Russia interesting
2,John and Peter are brothers. However they don'...,John Peter brothers
3,John and Peter are brothers. However they don'...,However dont support much
4,Lucas Nogal Dunbercker is no longer happy. He ...,Lucas Nogal Dunbercker longer happy
5,Lucas Nogal Dunbercker is no longer happy. He ...,good car though
6,Europe is very culture rich. There are huge ch...,Europe culture rich
7,Europe is very culture rich. There are huge ch...,huge churches
8,Europe is very culture rich. There are huge ch...,big houses


In [None]:
import pyspark.sql.functions as F

result.withColumn(
    "tmp",
    F.explode("clean_text")) \
    .select("tmp.*").select("begin","end","result","metadata.sentence").show(truncate = False)

+-----+---+-----------------------------------+--------+
|begin|end|result                             |sentence|
+-----+---+-----------------------------------+--------+
|0    |16 |Peter good person                  |0       |
|0    |22 |life Russia interesting            |0       |
|0    |18 |John Peter brothers                |0       |
|29   |53 |However dont support much          |1       |
|0    |34 |Lucas Nogal Dunbercker longer happy|0       |
|43   |57 |good car though                    |1       |
|0    |18 |Europe culture rich                |0       |
|29   |41 |huge churches                      |1       |
|54   |63 |big houses                         |2       |
+-----+---+-----------------------------------+--------+



In [None]:
# if we hadn't used Sentence Detector, this would be what we got. (tokenizer gets document instead of sentences column)

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

tokenassembler = TokenAssembler()\
    .setInputCols(["document", "cleanTokens"]) \
    .setOutputCol("clean_text")

nlpPipeline = Pipeline(stages=[documentAssembler,
                               tokenizer,
                               normalizer,
                               stopwords_cleaner,
                               tokenassembler])

result = nlpPipeline.fit(spark_df).transform(spark_df)
result.select('text', 'clean_text.result').show(truncate=False)

+-----------------------------------------------------------------------------+-----------------------------------------------------+
|text                                                                         |result                                               |
+-----------------------------------------------------------------------------+-----------------------------------------------------+
|Peter is a very good person.                                                 |[Peter good person]                                  |
|My life in Russia is very interesting.                                       |[life Russia interesting]                            |
|John and Peter are brothers. However they don't support each other that much.|[John Peter brothers However dont support much]      |
|Lucas Nogal Dunbercker is no longer happy. He has a good car though.         |[Lucas Nogal Dunbercker longer happy good car though]|
|Europe is very culture rich. There are huge churches! and big

In [None]:
result.withColumn(
    "tmp",
    F.explode("clean_text")) \
    .select("tmp.*").select("begin","end","result","metadata.sentence").show(truncate = False)

+-----+---+---------------------------------------------------+--------+
|begin|end|result                                             |sentence|
+-----+---+---------------------------------------------------+--------+
|0    |16 |Peter good person                                  |0       |
|0    |22 |life Russia interesting                            |0       |
|0    |44 |John Peter brothers However dont support much      |0       |
|0    |50 |Lucas Nogal Dunbercker longer happy good car though|0       |
|0    |43 |Europe culture rich huge churches big houses       |0       |
+-----+---+---------------------------------------------------+--------+



**IMPORTANT NOTE:**

If you have some other steps & annotators in your pipeline that will need to use the tokens from cleaned text (assembled tokens), you will need to tokenize the processed text again as the original text is probably changed completely.