![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/35.04.DocumentAssembler.ipynb)

# **DocumentAssembler**

This notebook will cover the different parameters and usages of `DocumentAssembler`. DocumentAssembler() is a special transformer that does this for us; it creates the first annotation of type Document which may be used by annotators down the road.

**📖 Learning Objectives:**

1. Understand how to use `DocumentAssembler`.

2. Become comfortable using the different parameters of the `DocumentAssembler`.


**🔗 Helpful Links:**

- Documentation : [DocumentAssembler](https://nlp.johnsnowlabs.com/docs/en/annotators#documentassembler)

- Python Docs : [DocumentAssembler](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/document_assembler/index.html)

- Scala Docs : [DocumentAssembler](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/DocumentAssembler.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/).

## **📜 Background**


In Spark NLP, the transformations occur on different stages of `Pipelines`, each containing `annotations` that are input/output of those stages.

`DocumentAssembler` is an annotator that is used to transform raw texts into `DOCUMENT` annotations, and is often used as the first stage of the pipelines. 

## **🎬 Colab Setup**

In [None]:
# Install PySpark and Spark NLP
!pip install -q pyspark==3.3.0  spark-nlp==4.3.2

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.3/281.3 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m473.2/473.2 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [None]:
import sparknlp
from sparknlp.base import DocumentAssembler


spark = sparknlp.start()

## **🖨️ Input/Output Annotation Types**

- Input: `None` (raw texts)

- Output: `DOCUMENT`

## **🔎 Parameters**


- `inputCol()`: (String) The name of the column that will be converted. We can specify only one column here. It can read either a String column or an Array.

- `outputCol`: (optional) The name of the column in Document type that is generated. We can specify only one column here. Default is '**document**'.

- `idCol`: (optional) String type column with id information

- `metadataCol`: (optional) Map type column with metadata information.

- `cleanupMode`: (optional) Cleaning up options


### `setInputCol()`


setInputCol() is a parameter in the DocumentAssembler component of Spark NLP, which specifies the column name from your input DataFrame that will be converted into the Document format, suitable for further NLP processing. This parameter accepts only one column, and the column can either be of type String or an Array.

Suppose you have a DataFrame with two columns: "id" and "text". The "text" column contains the raw text that you want to process using Spark NLP. To convert the "text" column into Document format, you need to use the DocumentAssembler and set its InputCol() parameter to "text".

In [None]:
data = [
    (1, "I love working with SparkNLP."),
    (2, "Today is sunny.")
]

# Create a DataFrame
columns = ["id", "text"]
df = spark.createDataFrame(data, columns)

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
    
result = documentAssembler.transform(df)

result.select("document").show(truncate=False)

+-----------------------------------------------------------------------+
|document                                                               |
+-----------------------------------------------------------------------+
|[{document, 0, 28, I love working with SparkNLP., {sentence -> 0}, []}]|
|[{document, 0, 14, Today is sunny., {sentence -> 0}, []}]              |
+-----------------------------------------------------------------------+



### `setOutputCol`


The outputCol parameter in the DocumentAssembler determines the name of the output column that will store the processed documents.

For example, suppose you have a dataset with a column named 'text', and you want to use the DocumentAssembler to process the text data.

In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("processed_text")

In [None]:
result = document_assembler.transform(df)

result.select("processed_text").show(truncate=False)

+-----------------------------------------------------------------------+
|processed_text                                                         |
+-----------------------------------------------------------------------+
|[{document, 0, 28, I love working with SparkNLP., {sentence -> 0}, []}]|
|[{document, 0, 14, Today is sunny., {sentence -> 0}, []}]              |
+-----------------------------------------------------------------------+



In this example, the resulting DataFrame will have a new column named 'processed_text' containing the processed documents in the Document format, which can be used as input for further NLP tasks.

### `setIdCol()`


setIdCol() sets name of string type column for row id and provides a unique identifier (ID) for each item in the dataset.

Creating sample data with id column:

In [None]:
# define a schema for the dataset
schema = "id INT, text STRING"

data= [{"id": 0, "text": "The playful kittens chased the fluttering butterflies in the garden."},
       {"id": 1, "text": "During her vacation, Emily enjoyed playing tennis"}]

# create a DataFrame from the list of dictionaries
df = spark.createDataFrame(data, schema)
df.show(truncate=False)

+---+--------------------------------------------------------------------+
|id |text                                                                |
+---+--------------------------------------------------------------------+
|0  |The playful kittens chased the fluttering butterflies in the garden.|
|1  |During her vacation, Emily enjoyed playing tennis                   |
+---+--------------------------------------------------------------------+



Firstly, we will define `DocumentAssembler()` with no `setIdCol()` parameter and check the metadata of the "document" column. 

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")\
    .setCleanupMode("shrink")
    
result = documentAssembler.transform(df)

result.select("document").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------+
|document                                                                                                      |
+--------------------------------------------------------------------------------------------------------------+
|[{document, 0, 67, The playful kittens chased the fluttering butterflies in the garden., {sentence -> 0}, []}]|
|[{document, 0, 48, During her vacation, Emily enjoyed playing tennis, {sentence -> 0}, []}]                   |
+--------------------------------------------------------------------------------------------------------------+



In [None]:
result.select("document.metadata").show(truncate=False)

+-----------------+
|metadata         |
+-----------------+
|[{sentence -> 0}]|
|[{sentence -> 0}]|
+-----------------+



As seen above, we only have sentence number information under the metadata. <br/>

Now, let's define `setIdCol("id")` parameter and see the difference. 

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")\
    .setCleanupMode("shrink")\
    .setIdCol("id")

result = documentAssembler.transform(df)

result.select("document").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------+
|document                                                                                                               |
+-----------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 67, The playful kittens chased the fluttering butterflies in the garden., {id -> 0, sentence -> 0}, []}]|
|[{document, 0, 48, During her vacation, Emily enjoyed playing tennis, {id -> 1, sentence -> 0}, []}]                   |
+-----------------------------------------------------------------------------------------------------------------------+



In [None]:
result.select("document.metadata").show(truncate=False)

+--------------------------+
|metadata                  |
+--------------------------+
|[{id -> 0, sentence -> 0}]|
|[{id -> 1, sentence -> 0}]|
+--------------------------+



As you see above, we have id information under the metadata since we employed `setIdCol("id")` parameter. 

### `setMetadataCol()`


This parameter establishes the name of a Map type column that holds metadata information. It is employed to generate a column containing metadata details.

By using setIdCol(), we can assign ID information within the metadata, while setMetadataCol() enables us to specify any additional details within the metadata.

Creating sample data with a MapType column containing metadata information: 

In [None]:
from pyspark.sql.types import MapType, StringType, IntegerType
from pyspark.sql import SparkSession

# define a schema for the dataset
schema = "id INT, name STRING, properties MAP<STRING, INT>"

# create a list of dictionaries to represent the data
data = [{"id": 1, "name": "Samantha", "properties": {"age": 28, "height": 165, "weight": 60}},
        {"id": 2, "name": "James", "properties": {"age": 32, "height": 185, "weight": 80}},
        {"id": 3, "name": "Olivia", "properties": {"age": 40, "height": 172, "weight": 65}}]

# create a DataFrame from the list of dictionaries
df = spark.createDataFrame(data, schema)

# show the resulting DataFrame
df.show(truncate=False)

+---+--------+----------------------------------------+
|id |name    |properties                              |
+---+--------+----------------------------------------+
|1  |Samantha|{weight -> 60, age -> 28, height -> 165}|
|2  |James   |{weight -> 80, age -> 32, height -> 185}|
|3  |Olivia  |{weight -> 65, age -> 40, height -> 172}|
+---+--------+----------------------------------------+



Now, we will use `setMetadataCol("properties")` to specify metadata information for the "name" column. 

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("name")\
    .setOutputCol("document")\
    .setCleanupMode("shrink")\
    .setMetadataCol("properties")

result = documentAssembler.transform(df)

result.select("document").show(truncate=False)

+-----------------------------------------------------------------------------------------+
|document                                                                                 |
+-----------------------------------------------------------------------------------------+
|[{document, 0, 7, Samantha, {weight -> 60, age -> 28, height -> 165, sentence -> 0}, []}]|
|[{document, 0, 4, James, {weight -> 80, age -> 32, height -> 185, sentence -> 0}, []}]   |
|[{document, 0, 5, Olivia, {weight -> 65, age -> 40, height -> 172, sentence -> 0}, []}]  |
+-----------------------------------------------------------------------------------------+



In [None]:
result.select("document.result").show(truncate=False)

+----------+
|result    |
+----------+
|[Samantha]|
|[James]   |
|[Olivia]  |
+----------+



In [None]:
result.select("document.metadata").show(truncate=False)

+---------------------------------------------------------+
|metadata                                                 |
+---------------------------------------------------------+
|[{weight -> 60, age -> 28, height -> 165, sentence -> 0}]|
|[{weight -> 80, age -> 32, height -> 185, sentence -> 0}]|
|[{weight -> 65, age -> 40, height -> 172, sentence -> 0}]|
+---------------------------------------------------------+



### `setCleanupMode()`


setCleanupMode() can be used to pre-process the text (Default: disabled). It sets how to cleanup the document which has noisy content such as blank lines and tabs. 

Possible values for setCleanupMode :
- **disabled**: Don't change the source text (default).
- **inplace**: Removes new lines and tabs.
- **inplace_full**: Removes new lines and tabs but also those which were converted to strings (e.g., \n, \r, \t)
- **shrink**: removes new lines and tabs, plus merging multiple spaces and blank lines to a single space.
- **shrink_full**: remove new lines and tabs, including stringified values, plus shrinking spaces and blank lines.

We will add blank lines and tabs to our sample text in order to see how pre-processing features work.

In [None]:
sample_texts= """I love working with  \n   SparkNLP. \n

It is a perfect library.     I am living in Canada. 
"""

data = spark.createDataFrame([[sample_texts]]).toDF("text")

**`disabled`**

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")\
    .setCleanupMode("disabled")

result = documentAssembler.transform(data)

result.select("document").show(truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------+
|document                                                                                                                                  |
+------------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 90, I love working with  \n   SparkNLP. \n\n\nIt is a perfect library.     I am living in Canada. \n, {sentence -> 0}, []}]|
+------------------------------------------------------------------------------------------------------------------------------------------+



In [None]:
result.select("document.result").show(truncate=False)

+--------------------------------------------------------------------------------------------------+
|result                                                                                            |
+--------------------------------------------------------------------------------------------------+
|[I love working with  \n   SparkNLP. \n\n\nIt is a perfect library.     I am living in Canada. \n]|
+--------------------------------------------------------------------------------------------------+



Disabled values for setCleanupMode is a default. As you see th result nothing changed. Disabled option keeps sources as a original.

**`inplace`**

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")\
    .setCleanupMode("inplace")

result = documentAssembler.transform(data)

result.select("document").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------+
|document                                                                                                                             |
+-------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 90, I love working with      SparkNLP.    It is a perfect library.     I am living in Canada.  , {sentence -> 0}, []}]|
+-------------------------------------------------------------------------------------------------------------------------------------+



In [None]:
result.select("document.result").show(truncate=False)

+---------------------------------------------------------------------------------------------+
|result                                                                                       |
+---------------------------------------------------------------------------------------------+
|[I love working with      SparkNLP.    It is a perfect library.     I am living in Canada.  ]|
+---------------------------------------------------------------------------------------------+



Inplace option removes new lines and tabs. As you see the result there are no new lines and tabs. Inplace_full option is same as inplace option.

**`shrink`**

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")\
    .setCleanupMode("shrink")

result = documentAssembler.transform(data)

result.select("document").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------+
|document                                                                                                               |
+-----------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 76, I love working with SparkNLP. It is a perfect library. I am living in Canada., {sentence -> 0}, []}]|
+-----------------------------------------------------------------------------------------------------------------------+



In [None]:
result.select("document.result").show(truncate=False)

+-------------------------------------------------------------------------------+
|result                                                                         |
+-------------------------------------------------------------------------------+
|[I love working with SparkNLP. It is a perfect library. I am living in Canada.]|
+-------------------------------------------------------------------------------+



As seen above, there is no new lines and tabs. Merged multiple spaces and blank lines to a single space. shrink_full option is same as shrink option.