![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/SparkNLP_ReaderAssembler_Demo.ipynb)

# Introducing ReaderAssembler in SparkNLP

This notebook showcases the newly added `ReaderAssembler` annotator in Spark NLP. It provides a unified interface for combining multiple Spark NLP
readers (such as Reader2Doc, Reader2Table, and Reader2Image) into a single, configurable component.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!cp drive/MyDrive/JSL/sparknlp/sparknlp.jar .
!cp drive/MyDrive/JSL/sparknlp/spark_nlp-6.1.4-py2.py3-none-any.whl .

In [3]:
!pip install spark_nlp-6.1.4-py2.py3-none-any.whl

Processing ./spark_nlp-6.1.4-py2.py3-none-any.whl
Installing collected packages: spark-nlp
Successfully installed spark-nlp-6.1.4


In [4]:
import sparknlp

# # let's start Spark with Spark NLP with GPU enabled. If you don't have GPUs available remove this parameter.
spark = sparknlp.start()
print(sparknlp.version())

print("Apache Spark version: {}".format(spark.version))

Apache Spark version: 3.5.1


To illustrate the use of this reader, let’s define an HTML document containing image data and display a preview.

In [10]:
from IPython.core.display import display, HTML

html_code = """
<!DOCTYPE html>
<html>
<head>
    <title>Image Parsing Test</title>
</head>
<body>
<p style="font-size:12pt;">This is a normal paragraph.</p>
<h1>Test Images</h1>

<table>
  <tr>
    <td>Hello World</td>
  </tr>
</table>
<!-- Base64 inline PNG -->
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA
  AAAFCAYAAACNbyblAAAAHElEQVQI12P4
  //8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg=="
     alt="Base64 Red Dot" width="5" height="5">

<!-- External image -->
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/a7/React-icon.svg/1024px-React-icon.svg.png"
     alt="React Logo" width="50" height="50">
"""

display(HTML(html_code))

0
Hello World


As you can see in the image above, we have two files: a small red dot and an atom. We expect a VLM model to generate descriptions of these images for us.

In [11]:
with open("example-images.html", "w") as f:
    f.write(html_code)

In [12]:
empty_df = spark.createDataFrame([], "string").toDF("text")

In [16]:
from pyspark.ml import Pipeline
from sparknlp.reader.reader_assembler import ReaderAssembler

reader = ReaderAssembler() \
    .setContentType("text/html") \
    .setContentPath("./example-images.html") \
    .setOutputCol("document")

pipeline = Pipeline(stages=[reader])
model = pipeline.fit(empty_df)

reader_df = model.transform(empty_df)

In [17]:
reader_df.printSchema()

root
 |-- fileName: string (nullable = true)
 |-- document_text: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- document_table: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueCont

In [18]:
reader_df.select("document_text").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|document_text                                                                                                                                                                                            |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 26, This is a normal paragraph., {pageNumber -> 1, sentence -> 0, elementType -> Title}, []}, {document, 27, 37, Test Images, {pageNumber -> 1, sentence -> 1, elementType -> Title}, []}]|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [19]:
reader_df.select("document_table").show(truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------+
|document_table                                                                                                                      |
+------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 50, {"caption":"","header":[],"rows":[["Hello World"]]}, {pageNumber -> 1, sentence -> 2, elementType -> Table}, []}]|
+------------------------------------------------------------------------------------------------------------------------------------+



In [20]:
reader_df.select("document_image").show()

+--------------------+
|      document_image|
+--------------------+
|[{image, example-...|
+--------------------+



## Integration with SparkNLP VLM models

In [21]:
from sparknlp.annotator import Qwen2VLTransformer

visualQAClassifier = (
    Qwen2VLTransformer.pretrained()
    .setInputCols("document_image")
    .setOutputCol("answer")
)

pipeline = Pipeline().setStages([visualQAClassifier])
result_df = pipeline.fit(reader_df).transform(reader_df)

qwen2_vl_2b_instruct_int4 download started this may take some time.
Approximate size to download 1.4 GB
[OK!]


In [22]:
result_df.select("document_image.origin", "answer.result").show(truncate=False)

+------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|origin                                    |result                                                                                                                                                                                                                                                                                              |
+------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------