![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **MultiDocumentAssembler**

This notebook will cover the different parameters and usages of `MultiDocumentAssembler`. 

**📖 Learning Objectives:**

1. Understand how to prepare data into a format that is processable by SparkNLP

2. Become comfortable with how to apply some text pre-processing by using the parameters of `MultiDocumentAssembler` 

3. Be able to use this annotator in a question answering application


**🔗 Helpful Links:**

Documentation: [MultiDocumentAssembler](https://nlp.johnsnowlabs.com/docs/en/annotators#multidocumentassembler)

Python Docs: [MultiDocumentAssembler](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/multi_document_assembler/index.html#sparknlp.base.multi_document_assembler.MultiDocumentAssembler.setIdCol)

Scala Docs: [MultiDocumentAssembler](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/MultiDocumentAssembler.html)

Example Use Case: [Table Question Answering](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/17.Table_Question_Answering.ipynb)

## **📜 Background**

This annotator is used as the first stage of Spark NLP pipelines. It transforms raw texts into `DOCUMENT` type annotations that is used by other annotators.

`DocumentAssembler()` is another annotator that does transforms raw texts into `DOCUMENT` annotations, but `MultiDocumentAssembler` can take multiple inputs which is useful in such cases as Table Question Answering. 

## **🎬 Colab Setup**

In [1]:
!pip install -q pyspark==3.3.0  spark-nlp==4.3.1

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.3/281.3 MB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.7/471.7 KB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 KB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [2]:
import sparknlp


spark = sparknlp.start()
spark

## **🖨️ Input/Output Annotation Types**

- Input: `None`

- Output: `ARRAY[DOCUMENT]`

## **🔎 Parameters**


- `cleanupMode` (String): Cleaning up options, (Default: "disabled")

- `idCol` (String): String type column with id information

- `metadataCol` (String): Map type column with metadata information.



### `setCleanupMode()`


This parameter can be used to pre-process the text. It sets how to cleanup the document which has noisy content such as blank lines and tabs. 

Possible values for the CleanupMode :

- **disabled**: Source kept as original. This is a default.
- **inplace**: Removes new lines and tabs.
- **inplace_full**: Removes new lines and tabs but also those which were converted to strings (i.e. `"\n"`)
- **shrink**: Removes new lines and tabs, plus merging multiple spaces and blank lines to a single space (`strip`).
- **shrink_full**: Removes new lines and tabs, including stringified values, plus shrinking spaces and blank lines.


We will add blank lines and tabs to our sample text in order to see how pre-processing features work. 

In [3]:
sample_texts= """I love working with  \n   SparkNLP. \n

It is a perfect \tlibrary. 
"""

data = spark.createDataFrame([[sample_texts]]).toDF("text")

**`disabled`**

Building `MultiDocumentAssembler()` and transforming it with the example data. 

In [4]:
from sparknlp.base import MultiDocumentAssembler


documentAssembler = MultiDocumentAssembler()\
    .setInputCols("text")\
    .setOutputCols("document")\
    .setCleanupMode("disabled")

result = documentAssembler.transform(data)

result.select("document").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------+
|document                                                                                                         |
+-----------------------------------------------------------------------------------------------------------------+
|[{document, 0, 64, I love working with  \n   SparkNLP. \n\n\nIt is a perfect \tlibrary. \n, {sentence -> 0}, []}]|
+-----------------------------------------------------------------------------------------------------------------+



In [5]:
result.select("document.result").show(truncate=False)

+-------------------------------------------------------------------------+
|result                                                                   |
+-------------------------------------------------------------------------+
|[I love working with  \n   SparkNLP. \n\n\nIt is a perfect \tlibrary. \n]|
+-------------------------------------------------------------------------+



In [6]:
print(result.select("document.result").take(1)[0].result[0])

I love working with  
   SparkNLP. 


It is a perfect 	library. 



As seen above, there is no text pre-processing/cleaning applied. 

**`inplace`**

In [7]:
documentAssembler = MultiDocumentAssembler()\
    .setInputCols("text")\
    .setOutputCols("document")\
    .setCleanupMode("inplace")

result = documentAssembler.transform(data)

result.select("document").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------+
|document                                                                                                   |
+-----------------------------------------------------------------------------------------------------------+
|[{document, 0, 64, I love working with      SparkNLP.    It is a perfect  library.  , {sentence -> 0}, []}]|
+-----------------------------------------------------------------------------------------------------------+



In [8]:
result.select("document.result").show(truncate=False)

+-------------------------------------------------------------------+
|result                                                             |
+-------------------------------------------------------------------+
|[I love working with      SparkNLP.    It is a perfect  library.  ]|
+-------------------------------------------------------------------+



In [9]:
print(result.select("document.result").take(1)[0].result[0])

I love working with      SparkNLP.    It is a perfect  library.  


**`shrink`**

In [10]:
documentAssembler = MultiDocumentAssembler()\
    .setInputCols("text")\
    .setOutputCols("document")\
    .setCleanupMode("shrink")

result = documentAssembler.transform(data)

result.select("document").show(truncate=False)

+------------------------------------------------------------------------------------------------+
|document                                                                                        |
+------------------------------------------------------------------------------------------------+
|[{document, 0, 53, I love working with SparkNLP. It is a perfect library., {sentence -> 0}, []}]|
+------------------------------------------------------------------------------------------------+



In [11]:
result.select("document.result").show(truncate=False)

+--------------------------------------------------------+
|result                                                  |
+--------------------------------------------------------+
|[I love working with SparkNLP. It is a perfect library.]|
+--------------------------------------------------------+



In [12]:
print(result.select("document.result").take(1)[0].result[0])

I love working with SparkNLP. It is a perfect library.


### `setIdCol()`

This parameter sets the name of string type column for row id. It is used to specify the id information under the metadata for the target document column. 

Creating sample data with id column:

In [13]:
# define a schema for the dataset
schema = "id INT, text STRING"

data= [{"id": 0, "text": "His name is Jack and lives in New York"},
       {"id": 1, "text": "She lives in LA"}]

# create a DataFrame from the list of dictionaries
df = spark.createDataFrame(data, schema)
df.show(truncate=False)

+---+--------------------------------------+
|id |text                                  |
+---+--------------------------------------+
|0  |His name is Jack and lives in New York|
|1  |She lives in LA                       |
+---+--------------------------------------+



Firstly, we will define `MultiDocumentAssembler()` with no `setIdCol()` parameter and check the metadata of the "document" column. 

In [14]:
documentAssembler = MultiDocumentAssembler()\
    .setInputCols("text")\
    .setOutputCols("document")\
    .setCleanupMode("shrink")
    
result = documentAssembler.transform(df)

result.select("document").show(truncate=False)

+--------------------------------------------------------------------------------+
|document                                                                        |
+--------------------------------------------------------------------------------+
|[{document, 0, 37, His name is Jack and lives in New York, {sentence -> 0}, []}]|
|[{document, 0, 14, She lives in LA, {sentence -> 0}, []}]                       |
+--------------------------------------------------------------------------------+



In [15]:
result.select("document.metadata").show(truncate=False)

+-----------------+
|metadata         |
+-----------------+
|[{sentence -> 0}]|
|[{sentence -> 0}]|
+-----------------+



As seen above, we only have sentence number information under the metadata. <br/>

Now, let's define `setIdCol("id")` parameter and see the difference. 

In [16]:
documentAssembler = MultiDocumentAssembler()\
    .setInputCols("text")\
    .setOutputCols("document")\
    .setCleanupMode("shrink")\
    .setIdCol("id")

result = documentAssembler.transform(df)

result.select("document").show(truncate=False)

+-----------------------------------------------------------------------------------------+
|document                                                                                 |
+-----------------------------------------------------------------------------------------+
|[{document, 0, 37, His name is Jack and lives in New York, {id -> 0, sentence -> 0}, []}]|
|[{document, 0, 14, She lives in LA, {id -> 1, sentence -> 0}, []}]                       |
+-----------------------------------------------------------------------------------------+



In [17]:
result.select("document.metadata").show(truncate=False)

+--------------------------+
|metadata                  |
+--------------------------+
|[{id -> 0, sentence -> 0}]|
|[{id -> 1, sentence -> 0}]|
+--------------------------+



As you see above, we have id information under the metadata since we employed `setIdCol("id")` parameter. 

### `setMetadataCol()`

This parameter sets the name of the column containing metadata information. The information should be a `dict` (`Map` type in Spark).

With the `setIdCol()`, we were able to define id information under the metadata while we are able to define any other information under the metadata by using the `setMetadataCol()` parameter. 

Creating sample data with a MapType column containing metadata information: 

In [18]:
# define a schema for the dataset
schema = "id INT, name STRING, properties MAP<STRING, INT>"

# create a list of dictionaries to represent the data
data = [{"id": 1, "name": "Alice", "properties": {"age": 25, "height": 170}},
        {"id": 2, "name": "Bob", "properties": {"age": 30, "height": 180}},
        {"id": 3, "name": "Charlie", "properties": {"age": 35, "height": 175}}]

# create a DataFrame from the list of dictionaries
df = spark.createDataFrame(data, schema)

# show the resulting DataFrame
df.show(truncate=False)

+---+-------+--------------------------+
|id |name   |properties                |
+---+-------+--------------------------+
|1  |Alice  |{age -> 25, height -> 170}|
|2  |Bob    |{age -> 30, height -> 180}|
|3  |Charlie|{age -> 35, height -> 175}|
+---+-------+--------------------------+



Now, we will use `setMetadataCol("properties")` to specify metadata information for the "name" column. 

In [19]:
documentAssembler = MultiDocumentAssembler()\
    .setInputCols("name")\
    .setOutputCols("document")\
    .setCleanupMode("shrink")\
    .setMetadataCol("properties")

result = documentAssembler.transform(df)

result.select("document").show(truncate=False)

+--------------------------------------------------------------------------+
|document                                                                  |
+--------------------------------------------------------------------------+
|[{document, 0, 4, Alice, {age -> 25, height -> 170, sentence -> 0}, []}]  |
|[{document, 0, 2, Bob, {age -> 30, height -> 180, sentence -> 0}, []}]    |
|[{document, 0, 6, Charlie, {age -> 35, height -> 175, sentence -> 0}, []}]|
+--------------------------------------------------------------------------+



In [20]:
result.select("document.result").show(truncate=False)

+---------+
|result   |
+---------+
|[Alice]  |
|[Bob]    |
|[Charlie]|
+---------+



Checking the metadata of the "document" column. 

In [21]:
result.select("document.metadata").show(truncate=False)

+-------------------------------------------+
|metadata                                   |
+-------------------------------------------+
|[{age -> 25, height -> 170, sentence -> 0}]|
|[{age -> 30, height -> 180, sentence -> 0}]|
|[{age -> 35, height -> 175, sentence -> 0}]|
+-------------------------------------------+



## Use Case: TAPAS for Table Question Answering

In this section, we will use `MultiDocumentAssembler()` annotator with **TAPAS for Table Question Answering** task. <br/>

TAPAS needs the `MultiDocumentAssembler()` annotator to assemble the table and the questions as two `DOCUMENT` type annotations coming from two columns in the data frame.

Creating an example table and some questions

In [22]:
# Table
json_data = """
{
  "header": ["name", "money", "age"],
  "rows": [
    ["Donald Trump", "$100,000,000", "75"],
    ["Elon Musk", "$20,000,000,000,000", "55"]
  ]
}
"""

# Questions
queries = [
    "Who earns less than 200,000,000?",
    "Who earns 100,000,000?", 
    "How much money has Donald Trump?",
    "How old are they?",
]

# Spark data frame with two columns
data = spark.createDataFrame([
        [json_data, " ".join(queries)]
    ]).toDF("table_json", "questions")

Importing extra annotators

In [23]:
from sparknlp.base import Pipeline, TableAssembler
from sparknlp.annotator import SentenceDetector, TapasForQuestionAnswering

Creating a pipeline to perform the task

In [24]:
# Text to `DOCUMENT` annotation
document_assembler = MultiDocumentAssembler() \
    .setInputCols("table_json", "questions") \
    .setOutputCols("document_table", "document_questions")

# Split into sentences
sentence_detector = SentenceDetector() \
    .setInputCols(["document_questions"]) \
    .setOutputCol("questions")

# Transform the JSON formatted table into a proper format
table_assembler = TableAssembler()\
    .setInputCols(["document_table"])\
    .setOutputCol("table")

# Last component is `TapasForQuestionAnswering`, which will carry out the inference process
tapas = TapasForQuestionAnswering\
    .pretrained("table_qa_tapas_base_finetuned_wtq", "en")\
    .setInputCols(["questions", "table"])\
    .setOutputCol("answers")


pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    table_assembler,
    tapas
])

table_qa_tapas_base_finetuned_wtq download started this may take some time.
Approximate size to download 394.7 MB
[OK!]


This is the result on fit/transform:

In [25]:
# Data frame manipulations
import pyspark.sql.functions as F

In [26]:
model = pipeline.fit(data)
result = model.transform(data)

result.select(F.explode(F.arrays_zip(result.questions.result, result.answers.result)).alias("cols"))\
      .select(F.expr("cols['0'] as question"), F.expr("cols['1'] as answer"))\
      .show(truncate=False)

+--------------------------------+-----------------+
|question                        |answer           |
+--------------------------------+-----------------+
|Who earns less than 200,000,000?|Donald Trump     |
|Who earns 100,000,000?          |Donald Trump     |
|How much money has Donald Trump?|SUM($100,000,000)|
|How old are they?               |AVERAGE(75, 55)  |
+--------------------------------+-----------------+



## MultiDocumentAssembler with LightPipeline


At this section, we will cover the usage of `MultiDocumentAssembler` with `LightPipeline`. <br/>

We will demonstrate this with an example Question Answering use case by using a `BertForQuestionAnswering` pretrained model. 

Example data with questions and context to fit our pipeline.

In [27]:
example = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
example.show(truncate=False)

+---------------+----------------------------------------+
|question       |context                                 |
+---------------+----------------------------------------+
|What's my name?|My name is Clara and I live in Berkeley.|
+---------------+----------------------------------------+



Creating pipeline

In [28]:
from sparknlp.base import LightPipeline
from sparknlp.annotator import BertForQuestionAnswering

In [29]:
documentAssembler = MultiDocumentAssembler()\
    .setInputCols(["question", "context"])\
    .setOutputCols(['document_question', 'document_context'])

spanClassifier = BertForQuestionAnswering.pretrained("bert_qa_bert_large_cased_whole_word_masking_finetuned_squad","en") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)


pipeline = Pipeline().setStages([
documentAssembler,
spanClassifier
])

model= pipeline.fit(example)

bert_qa_bert_large_cased_whole_word_masking_finetuned_squad download started this may take some time.
Approximate size to download 1.2 GB
[OK!]


Creating LightPipeline and `fullAnnotate()` it with sample question and context. 

In [30]:
lmodel= LightPipeline(model)
lresult= lmodel.fullAnnotate("Where does he work?", "He is data scientist and works at John Snow Labs")

Checking the results

In [31]:
lresult

[{'document_question': [Annotation(document, 0, 18, Where does he work?, {}, [])],
  'document_context': [Annotation(document, 0, 47, He is data scientist and works at John Snow Labs, {}, [])],
  'answer': [Annotation(chunk, 0, 13, John Snow Labs, {'chunk': '0', 'start_score': '0.99399257', 'score': '0.9931092', 'end': '16', 'start': '14', 'end_score': '0.99222594', 'sentence': '0'}, [])]}]

In [32]:
import pandas as pd

answers= []
questions= []

for i, j in list(zip(lresult[0]["document_question"], lresult[0]["answer"])):
  questions.append(i.result)
  answers.append(j.result)
  

df= pd.DataFrame({"question": questions, "answer": answers})
df.head()

Unnamed: 0,question,answer
0,Where does he work?,John Snow Labs


As seen above, we specified metadata information for the document column. 