![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/annotation/text/english/document-assembler/Loading_Documents_With_DocumentAssembler.ipynb)


# **Loading Documents with DocumentAssembler**

In these examples we look at ways to use the DocumentAssembler.

## **0. Colab Setup**

In [None]:
!pip install -q pyspark==3.3.0  spark-nlp==4.3.1

In [None]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

spark

Spark NLP version:  4.2.8
Apache Spark version:  3.3.0


### **Create Spark Dataframe**

In [None]:
spark_df = spark.read.text('../spark-nlp-basics/sample-sentences-en.txt').toDF('text')

spark_df.show(truncate=False)

+-----------------------------------------------------------------------------+
|text                                                                         |
+-----------------------------------------------------------------------------+
|Peter is a very good person.                                                 |
|My life in Russia is very interesting.                                       |
|John and Peter are brothers. However they don't support each other that much.|
|Lucas Nogal Dunbercker is no longer happy. He has a good car though.         |
|Europe is very culture rich. There are huge churches! and big houses!        |
+-----------------------------------------------------------------------------+




DocumentAssembler() creates the first annotation of type Document which may be used by annotators down the road.

DocumentAssembler() comes from sparknlp.base class and has the following settable parameters. See the full list here and the source code here.


| Parametre  | Value | Description |
| - | - | - |
|**setInputCol()**       |String |The name of the column that will be converted. We can specify only one column here. It can read either a String column or an Array.|
|**setOutputCol()** |optional|The name of the column in Document type that is generated. We can specify only one column here. Default is '**document**'.|
|**setIdCol()**  |optional|String type column with id information|
|**setMetadataCol()** |optional|Map type column with metadata information.|
|**setCleanupMode()** |optional| Cleaning up options|


possible values for setCleanupMode :
  ```
  disabled: Source kept as original. This is a default.
  inplace: removes new lines and tabs.
  inplace_full: removes new lines and tabs but also those which were converted to strings (i.e. \n)
  shrink: removes new lines and tabs, plus merging multiple spaces and blank lines to a single space.
  shrink_full: remove new lines and tabs, including stringified values, plus shrinking spaces and blank lines.
  ```

In [None]:
from sparknlp.base import *

documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")\
    .setCleanupMode("shrink")

doc_df = documentAssembler.transform(spark_df)

doc_df.show(truncate=False)

+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
|text                                                                         |document                                                                                                               |
+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
|Peter is a very good person.                                                 |[{document, 0, 27, Peter is a very good person., {sentence -> 0}, []}]                                                 |
|My life in Russia is very interesting.                                       |[{document, 0, 37, My life in Russia is very interesting., {sentence -> 0}, []}]                                       |


At first, we define DocumentAssembler with desired parameters and then transform the data frame with it. The most important point to pay attention to here is that you need to use a String or String[Array] type column in .setInputCol(). So it doesn’t have to be named as text. You just use the column name as it is.

In [None]:
doc_df.printSchema()

root
 |-- text: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)



In [None]:
doc_df.select('document.result','document.begin','document.end').show(truncate=False)

+-------------------------------------------------------------------------------+-----+----+
|result                                                                         |begin|end |
+-------------------------------------------------------------------------------+-----+----+
|[Peter is a very good person.]                                                 |[0]  |[27]|
|[My life in Russia is very interesting.]                                       |[0]  |[37]|
|[John and Peter are brothers. However they don't support each other that much.]|[0]  |[76]|
|[Lucas Nogal Dunbercker is no longer happy. He has a good car though.]         |[0]  |[67]|
|[Europe is very culture rich. There are huge churches! and big houses!]        |[0]  |[68]|
+-------------------------------------------------------------------------------+-----+----+



The new column is in an array of struct type and has the parameters shown above. The annotators and transformers all come with universal metadata that would be filled down the road depending on the annotators being used. Unless you want to append other Spark NLP annotators to DocumentAssembler(), you don’t need to know what all these parameters mean for now. So we will talk about them in the following articles. You can access all these parameters with {column name}.{parameter name}.

Let’s print out the first item’s result.

In [None]:
doc_df.select("document.result").take(1)

[Row(result=['Peter is a very good person.'])]

If we would like to flatten the document column, we can do as follows.


In [None]:
import pyspark.sql.functions as F

doc_df.withColumn(
    "tmp",
    F.explode("document"))\
    .select("tmp.*")\
    .show(truncate=False)

+-------------+-----+---+-----------------------------------------------------------------------------+---------------+----------+
|annotatorType|begin|end|result                                                                       |metadata       |embeddings|
+-------------+-----+---+-----------------------------------------------------------------------------+---------------+----------+
|document     |0    |27 |Peter is a very good person.                                                 |{sentence -> 0}|[]        |
|document     |0    |37 |My life in Russia is very interesting.                                       |{sentence -> 0}|[]        |
|document     |0    |76 |John and Peter are brothers. However they don't support each other that much.|{sentence -> 0}|[]        |
|document     |0    |67 |Lucas Nogal Dunbercker is no longer happy. He has a good car though.         |{sentence -> 0}|[]        |
|document     |0    |68 |Europe is very culture rich. There are huge churches! and 