![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/open-source-nlp/01.1.Reader2_Family_Native_File_Readers.ipynb)

# 01.1 Reader2 Family: Native File Readers in Spark NLP

This notebook introduces the **Reader2 family of annotators**, a powerful set of components that bring native document ingestion directly into Spark NLP pipelines.

With these readers, you can extract text, tables, and images from a wide range of file formats without external preprocessing, making large-scale document workflows simpler, faster, and more reproducible.

## Overview

The Reader2 annotators allow you to load and structure multi-format content directly as Spark NLP annotations:

- **`Reader2Doc`**: extracts and structures textual content into `Document` annotations.  
- **`Reader2Image`**: extracts and structures images from standalone files or embedded media in documents.  
- **`Reader2Table`**: extracts and structures tabular data into machine-readable formats for downstream NLP or analytics tasks.

Together, these annotators enable fully integrated **Document AI pipelines**, capable of reading, parsing, and analyzing complex documents end-to-end inside Spark NLP without needing external file readers, Pandas, or OCR preprocessing.

## Supported File Formats

| Reader | Supported File Types |
|:-------|:----------------------|
| **Reader2Doc** | TXT, HTML, DOC, DOCX, XLS, XLSX, PPT, PPTX, EML, MSG, PDF |
| **Reader2Image** | PNG, JPG, BMP, GIF, PDF, DOCX, PPTX, XLSX, HTML, EML |
| **Reader2Table** | HTML, DOCX, XLSX, PPTX, CSV |

### **Colab Setup**

In [None]:
!wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

spark = sparknlp.start(gpu=True)

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 6.2.0
Apache Spark version: 3.5.1


### **Prepare data**

In [None]:
%%bash
set -e
git clone -q --no-checkout https://github.com/JohnSnowLabs/spark-nlp-workshop.git tmp
cd tmp
git sparse-checkout set reader2doc reader2table reader2image
git checkout -q
mkdir -p /content/files
mv reader2doc reader2table reader2image /content/files/
cd ..
rm -rf tmp


In [None]:
!ls files

reader2doc  reader2image  reader2table


## Reader2Doc




Instead of handling each file format separately (like PDFs, Word files, or emails), `Reader2Doc` abstracts away the complexity and outputs clean, unified text for every document into structured `Document` annotations. This makes it ideal for large-scale ingestion pipelines where documents come from mixed sources.

*Supported File Formats:*
- Text: `.txt`  
- HTML: `.html`, `.htm`  
- Microsoft Word: `.doc`, `.docx`  
- Microsoft Excel: `.xls`, `.xlsx`  
- Microsoft PowerPoint: `.ppt`, `.pptx`  
- Email files: `.eml`, `.msg`  
- PDF documents: `.pdf`


> This annotator is usually the **first stage** of a document-based pipeline, preparing structured text for tokenization, sentence segmentation, or downstream NLP tasks.

### Basic usage

Lets define our Pipeline

In [None]:
from sparknlp.reader.reader2doc import Reader2Doc

reader2doc = Reader2Doc().setContentPath("files/reader2doc")
pipeline = Pipeline(stages=[reader2doc])


Unlike traditional Spark NLP annotators, Reader2Doc reads files directly from the specified `contentPath` rather than from an input column.

Because of this, it does not require a `.setInputCols()` parameter. Instead, we initialize the pipeline with an **empty DataFrame**, since the reader itself handles file ingestion.

In [None]:
empty_df = spark.createDataFrame([], "string").toDF("text")

model = pipeline.fit(empty_df)
result_df = model.transform(empty_df)


In [None]:
result_df.show(truncate=False)

+-------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
result_df.printSchema()

root
 |-- fileName: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- exception: void (nullable = true)



let's summarize these extracted documents using `AutoGGUFModel`, which allows native integration of **Llama.cpp** compatible models (such as **Phi-4,** **LLaMA**, or **Mistral**) directly within Spark NLP.

In [None]:
from sparknlp.annotator import AutoGGUFModel

auto_gguf_model = (
    AutoGGUFModel.pretrained("phi_4_mini_instruct_bf16_gguf", "en")
    .setInputCols(["document"])
    .setOutputCol("completions")
    .setSystemPrompt("You are a helpful assistant. Read the text below and write a clear, concise summary capturing the key ideas, facts, and tone.")
    .setCachePrompt(True)
    .setNPredict(200)
    .setTemperature(0.3)
    .setTopK(30)
    .setTopP(0.9)
    .setNCtx(4096)
    .setNThreads(8)
    .setNThreadsBatch(8)
    .setIgnoreEos(False)
    .setLogVerbosity(1)
)

pipeline = Pipeline().setStages([
    reader2doc,
    auto_gguf_model
])

model = pipeline.fit(empty_df)
result = model.transform(empty_df)


phi_4_mini_instruct_bf16_gguf download started this may take some time.
Approximate size to download 5.7 GB
[OK!]


In [None]:
result.select("fileName", "completions.result").show(truncate=False)

+-------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|fileName                             |result                                                                                                                                                                                                                                                           

### Exploring Parameters

| Parameter | Description | Default |
|:-----------|:-------------|:----------|
| **contentPath** | Path to the input documents (local directory, file, or URL). | Required |
| **contentType** | MIME type of the document (e.g., `text/html`, `application/pdf`). Usually inferred automatically. | Optional |
| **explodeDocs** | Whether to output one document per row. Set to `False` to combine all content into a single record per file. | `False` |
| **flattenOutput** | If `True`, returns plain text with minimal metadata instead of full annotation structures. | `False` |
| **outputAsDocument** | Whether to output all content as a single combined `Document`. | `False` |
| **excludeNonText** | Exclude non-textual data like tables or images from the output. | `False` |
| **storeContent** | Include the raw file content in the DataFrame (useful for debugging or serialization). | `False` |
| **ignoreExceptions** | Continue processing even if some documents fail to parse. | `True` |
| **includeSlideNotes** | Include speaker notes when reading PowerPoint files. | `False` |
| **addAttachmentContent** | Extract plain-text attachments from emails (`.eml`, `.msg`). | `False` |


Let's use `SparkNLP_New_Notebooks_Proposals.xlsx` for this

In [None]:
!mkdir -p single-file & cp files/reader2doc/SparkNLP_New_Notebooks_Proposals.xlsx single-file/

**explodeDocs**

Whether to explode the documents into separate rows.

In [None]:
reader2doc = Reader2Doc() \
    .setContentPath("./single-file") \
    .setExplodeDocs(True)

pipeline = Pipeline(stages=[reader2doc])
empty_df = spark.createDataFrame([], "string").toDF("text")

model = pipeline.fit(empty_df)
result_df = model.transform(empty_df)

result_df.show(truncate=False)


+-------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+
|fileName                             |document                                                                                                                                                                                                                                                                                                                                                                                                                                      |exception|
+-------------------------------------

In [None]:
reader2doc = Reader2Doc() \
    .setContentPath("./single-file") \
    .setExplodeDocs(False)

pipeline = Pipeline(stages=[reader2doc])
empty_df = spark.createDataFrame([], "string").toDF("text")

model = pipeline.fit(empty_df)
result_df = model.transform(empty_df)

result_df.show(truncate=False)

+-------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

**flattenOutput**

If true, output is flattened to plain text with minimal metadata

In [None]:
reader2doc = Reader2Doc() \
    .setContentPath("./single-file") \
    .setFlattenOutput(True)

pipeline = Pipeline(stages=[reader2doc])
empty_df = spark.createDataFrame([], "string").toDF("text")

model = pipeline.fit(empty_df)
result_df = model.transform(empty_df)

result_df.select("document.metadata").show(truncate=False)


+----------------------------+
|metadata                    |
+----------------------------+
|[{}, {}, {}, {}, {}, {}, {}]|
+----------------------------+



In [None]:
reader2doc = Reader2Doc() \
    .setContentPath("./single-file") \
    .setFlattenOutput(False)

pipeline = Pipeline(stages=[reader2doc])
empty_df = spark.createDataFrame([], "string").toDF("text")

model = pipeline.fit(empty_df)
result_df = model.transform(empty_df)

result_df.select("document.metadata").show(truncate=False)


+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|metadata                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     

**outputAsDocument**

Whether to return all sentences joined into a single document

In [None]:
reader2doc = Reader2Doc() \
    .setContentPath("./single-file") \
    .setOutputAsDocument(True)

pipeline = Pipeline(stages=[reader2doc])
empty_df = spark.createDataFrame([], "string").toDF("text")

model = pipeline.fit(empty_df)
result_df = model.transform(empty_df)

result_df.select("document").show(truncate=False)


+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
reader2doc = Reader2Doc() \
    .setContentPath("./single-file") \
    .setOutputAsDocument(False)

pipeline = Pipeline(stages=[reader2doc])
empty_df = spark.createDataFrame([], "string").toDF("text")

model = pipeline.fit(empty_df)
result_df = model.transform(empty_df)

result_df.select("document").show(truncate=False)


+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Reader2Table


The Reader2Table annotator enables seamless extraction of tabular content from documents within existing Spark NLP workflows. It allows you to efficiently parse tables from a wide variety of file types and return them as structured Spark DataFrames with metadata, ready for downstream processing or analysis.

*Supported File Formats:*
- HTML: `.html`, `.htm`  
- Word documents: `.doc`, `.docx`  
- Excel spreadsheets: `.xls`, `.xlsx`  
- PowerPoint presentations: `.ppt`, `.pptx`  
- CSV files: `.csv`  

### Basic usage

Lets define our Pipeline

In [None]:
from sparknlp.reader.reader2table import Reader2Table

reader2table = Reader2Table().setContentPath("files/reader2table")
pipeline = Pipeline(stages=[reader2table])


In [None]:
empty_df = spark.createDataFrame([], "string").toDF("text")

model = pipeline.fit(empty_df)
result_df = model.transform(empty_df)


In [None]:
result_df.show(truncate=False)

+----------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+
|fileName                    |document                                                                                                                                                                                                                                                                                                                                                                                                                                                |exception|
+----------------------------+------

In [None]:
result_df.printSchema()

root
 |-- fileName: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- exception: void (nullable = true)



In [None]:
result_df.select("document.result").collect()[0].result[0]

'{"caption":"","header":[],"rows":[["Department","2023 ($M)","2024 ($M)","% Growth"],["Research","25.0","35.0","40%"],["Infrastructure","15.0","18.0","20%"],["Product","10.0","16.0","60%"],["Governance","5.0","6.0","20%"]]}'

As you see the table is in a json format, we can use pandas to display it as a DataFrame

In [None]:
import json
import pandas as pd

# Extract Result
hardware_row = result_df.filter(result_df.fileName == "hardware_benchmarks.docx").first().document[0]['result']

# Build Pandas DataFrame
data = json.loads(hardware_row)
table = pd.DataFrame(data["rows"], columns=data["header"])

table


Unnamed: 0,GPU,Memory (GB),TFLOPs,Power (W),Price ($)
0,RTX 4090,24,83,450,1599
1,A100,80,312,400,9999
2,H100,80,730,700,29999
3,MI300X,192,1230,750,29999


Let's Run A Table Question Answering Pipeline with TAPAS

Let's use `hardware_benchmarks.docx` for this

In [None]:
!mkdir -p single-table-file & cp files/reader2table/hardware_benchmarks.docx single-table-file/


In [None]:
reader = Reader2Table() \
    .setContentPath("./single-table-file") \
    .setOutputCol("document_table")

document_assembler = DocumentAssembler()\
    .setInputCol("questions")\
    .setOutputCol("document_questions")

sentence_detector = SentenceDetector()\
    .setInputCols(["document_questions"])\
    .setOutputCol("questions_detected")

table_assembler = TableAssembler()\
    .setInputCols(["document_table"])\
    .setOutputCol("table")

tapas = TapasForQuestionAnswering\
    .pretrained("table_qa_tapas_base_finetuned_wikisql_supervised", "en")\
    .setInputCols(["questions_detected", "table"])\
    .setOutputCol("answers")

pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    table_assembler,
    tapas
])

questions_df = spark.createDataFrame([
    ["Which GPU has the highest memory?"],
    ["What is the TFLOPs of the RTX 4090?"],
    ["How much power does the H100 consume?"],
    ["Which GPU costs $9999?"],
    ["What is the price of the MI300X?"],
    ["Which GPU has the highest TFLOPs?"],
    ["How many gigabytes of memory does the A100 have?"],
    ["What is the power consumption of the RTX 4090?"],
    ["Which GPU has 192 GB of memory?"],
    ["Compare the price of the H100 and MI300X."]
], ["questions"])

# attach table to every question row
table_df = reader.transform(empty_df).select("document_table")
combined_df = (questions_df.crossJoin(table_df))

model = pipeline.fit(combined_df)
result_df = model.transform(combined_df)

result_df.select("questions", "answers.result").show(truncate=False)


table_qa_tapas_base_finetuned_wikisql_supervised download started this may take some time.
Approximate size to download 394.7 MB
[OK!]
+------------------------------------------------+--------+
|questions                                       |result  |
+------------------------------------------------+--------+
|Which GPU has the highest memory?               |[MI300X]|
|What is the TFLOPs of the RTX 4090?             |[83]    |
|How much power does the H100 consume?           |[700]   |
|Which GPU costs $9999?                          |[A100]  |
|What is the price of the MI300X?                |[29999] |
|Which GPU has the highest TFLOPs?               |[MI300X]|
|How many gigabytes of memory does the A100 have?|[80]    |
|What is the power consumption of the RTX 4090?  |[450]   |
|Which GPU has 192 GB of memory?                 |[MI300X]|
|Compare the price of the H100 and MI300X.       |[29999] |
+------------------------------------------------+--------+



### Exploring Parameters

| **Parameter**             | **Description**                                                                                                              | **Default**        |
|----------------------------|------------------------------------------------------------------------------------------------------------------------------|-------------------|
| **contentPath**           | `Path` to the content source.                                                                                                 |                   |
| **inputCol**              | `Input column name` in the DataFrame.                                                                                         |                   |
| **outputCol**             | `Output column name` in the DataFrame.                                                                                        |                   |
| **outputFormat**          | Output format for the table content. Options: `json-table`, `html-table`.                            | `json-table`      |
| **appendCells**           | Whether to append all rows into a single content block instead of creating separate elements per row.                           | `false`           |
| **cellSeparator**         | String used to join cell values in a row when assembling textual output.                                                      | `" "`             |
| **explodeDocs**           | Whether to explode the documents into separate rows.                                                                          | `false`           |
| **flattenOutput**         | If `true`, output is flattened to plain text with minimal metadata.                                                           | `false`           |
| **groupBrokenParagraphs** | Whether to merge fragmented lines into coherent paragraphs using heuristics based on line length and structure.               | `true`            |
| **paragraphSplit**        | Regex pattern used to detect paragraph boundaries when grouping broken paragraphs.                                             | `"\n\n"`          |
| **shortLineWordThreshold**| Maximum word count for a line to be considered "short" during broken paragraph grouping.                                       | `10`              |
| **includePageBreaks**     | Whether to detect and tag content with page break metadata.                                                                  | `false`           |
| **includeSlideNotes**     | Whether to extract speaker notes from slides. When enabled, notes are included as narrative text elements.                    | `false`           |
| **addAttachmentContent**  | Whether to extract and include the textual content of plain-text attachments in the output.                                   | `false`           |
| **storeContent**          | Whether to include the raw file content in the output DataFrame as a separate `content` column.                               | `false`           |
| **ignoreExceptions**      | Whether to ignore exceptions during processing.                                                                              | `false`           |
| **inferTableStructure**   | Whether to generate an HTML `<table>` representation from structured table content.                                           | `false`           |
| **outputAsDocument**      | Whether to return all sentences joined into a single document.                                                               | `false`           |
| **timeout**               | Timeout value in seconds for reading remote HTML resources. Applied when fetching content from URLs.                           | `30`              |


**outputFormat**

Output format for the table content. Options are `html-table` or `json-table`.

In [None]:
reader2table = Reader2Table() \
    .setContentPath("./single-table-file") \
    .setOutputFormat("html-table")

pipeline = Pipeline(stages=[reader2table])
empty_df = spark.createDataFrame([], "string").toDF("text")

model = pipeline.fit(empty_df)
result_df = model.transform(empty_df)

result_df.select("document.result").show(truncate=False)


+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                        |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
reader2table = Reader2Table() \
    .setContentPath("./single-table-file") \
    .setOutputFormat("json-table")

pipeline = Pipeline(stages=[reader2table])
empty_df = spark.createDataFrame([], "string").toDF("text")

model = pipeline.fit(empty_df)
result_df = model.transform(empty_df)

result_df.select("document.result").show(truncate=False)


+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{"caption":"","header":["GPU","Memory (GB)","TFLOPs","Power (W)","Price ($)"],"rows":[["RTX 4090","24","83","450","1599"],["A100","80","312","400","9999"],["H100","80","730","700","29999"],["MI300X","192","1230","750","29999"]]}]|
+-------------------------------------------------------------------

## Reader2Image

The Reader2Image annotator enables seamless integration of image reading capabilities into existing Spark NLP workflows. It allows you to efficiently extract and structure image content from both individual image files and documents with embedded images.

With this, you can read image files or extract images from documents. All extracted images are returned as structured Spark DataFrames with associated metadata, ready for downstream processing in Spark NLP pipelines.

Supported File Formats:
- Image files: `.png`, `.jpg`, `.jpeg`, `.bmp`, `.gif`  
- Documents with embedded images: `.pdf`, `.doc`, `.docx`, `.ppt`, `.pptx`, `.xls`, `.xlsx`, `.eml`, `.msg`, `.html`, `.htm`, `.md`


### Basic usage

In [None]:
from sparknlp.reader.reader2image import Reader2Image

reader2image = Reader2Image()\
    .setContentPath("files/reader2image")\
    .setOutputCol("image")

pipeline = Pipeline(stages=[reader2image])


In [None]:
empty_df = spark.createDataFrame([], "string").toDF("text")

model = pipeline.fit(empty_df)
result_df = model.transform(empty_df)


In [None]:
result_df.show()

+--------------------+--------------------+---------+
|            fileName|               image|exception|
+--------------------+--------------------+---------+
|      line_chart.jpg|[{image, line_cha...|     NULL|
|johnsnowlabs_logo...|[{image, johnsnow...|     NULL|
|   Venn_diagram.jpeg|[{image, Venn_dia...|     NULL|
|windows_wallpaper...|[{image, windows_...|     NULL|
|              67.gif|[{image, 67.gif, ...|     NULL|
| embedded_image.docx|[{image, embedded...|     NULL|
+--------------------+--------------------+---------+



In [None]:
result_df.printSchema()

root
 |-- fileName: string (nullable = true)
 |-- image: array (nullable = false)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- origin: string (nullable = true)
 |    |    |-- height: integer (nullable = false)
 |    |    |-- width: integer (nullable = false)
 |    |    |-- nChannels: integer (nullable = false)
 |    |    |-- mode: integer (nullable = false)
 |    |    |-- result: binary (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- text: string (nullable = true)
 |-- exception: string (nullable = true)



Let's use `Qwen2VLTransformer` to describe these images. We will use the existing `image` column obtained from `Reader2Image` and add another column, `text`, which will contain the prompt for the VLM.

In [None]:
from pyspark.sql.functions import lit

prompt_df = result_df.withColumn(
    "text",
    lit(
        "<|im_start|>system"
        "You are a helpful assistant that describes images clearly and accurately."
        "<|im_end|>"
        "<|im_start|>user"
        "<|vision_start|><|image_pad|><|vision_end|>"
        "Describe this image in detail."
        "<|im_end|>"
        "<|im_start|>assistant"
    )
)

prompt_df.show()


+--------------------+--------------------+---------+--------------------+
|            fileName|               image|exception|                text|
+--------------------+--------------------+---------+--------------------+
|      line_chart.jpg|[{image, line_cha...|     NULL|<|im_start|>syste...|
|johnsnowlabs_logo...|[{image, johnsnow...|     NULL|<|im_start|>syste...|
|   Venn_diagram.jpeg|[{image, Venn_dia...|     NULL|<|im_start|>syste...|
|windows_wallpaper...|[{image, windows_...|     NULL|<|im_start|>syste...|
|              67.gif|[{image, 67.gif, ...|     NULL|<|im_start|>syste...|
| embedded_image.docx|[{image, embedded...|     NULL|<|im_start|>syste...|
+--------------------+--------------------+---------+--------------------+



In [None]:
from sparknlp.annotator import Qwen2VLTransformer

multiModel = (
    Qwen2VLTransformer.pretrained("qwen2_vl_2b_instruct_int4")
    .setInputCols("image")
    .setOutputCol("answer")
)

pipeline = Pipeline().setStages([multiModel])

model = pipeline.fit(prompt_df)
result = model.transform(prompt_df)


qwen2_vl_2b_instruct_int4 download started this may take some time.
Approximate size to download 1.4 GB
[OK!]


In [None]:
result.select("fileName", "answer.result").show(truncate=False)

+---------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|fileName             |result                  

### Exploring Parameters

| **Parameter**          | **Why it’s useful / Notes**                                                                                           | **Default** |
| ---------------------- | --------------------------------------------------------------------------------------------------------------------- | ----------- |
| **contentPath**        | Core input: the source of the file(s). Can be a local path or a URL.                                                 |             |
| **outputCol**          | Where the extracted results go.                                                                                      |             |
| **readAsImage**        | Key for PDFs: whether to process pages as images. Necessary for scanned PDFs where text extraction directly won't work. | `false`     |
| **splitPage**          | Splits the document per page. Improves performance and enables page-specific operations.                               | `true`      |
| **onlyPageNum**        | Extracts only page numbers if you don’t need full text content.                                                      | `false`     |
| **storeContent**       | Retains raw file bytes alongside structured output, useful if you want to save or process the original file later.    | `false`     |
| **flattenOutput**      | Outputs clean, concatenated text with minimal metadata. Good for quick analysis or NLP tasks.                         | `false`     |
| **normalizeLigatures** | Converts ligatures like `ﬂ` into standard characters (`fl`). Helps avoid text artifacts, especially in scanned PDFs. | `true`      |
| **timeout**            | Maximum seconds to wait when fetching remote resources (URLs). Helps prevent notebook hangs.                          | `30`        |

**Tips for Usage:**   
- Use `readAsImage` and `splitPage` for scanned PDFs or large documents to improve extraction quality and performance.  
- `flattenOutput` is optional but make it easier to handle the output in downstream analysis or NLP pipelines.  
- `storeContent` is useful if you want to archive the original file alongside extracted text for auditing or reuse.  
- `normalizeLigatures` is usually recommended for PDFs to prevent weird characters from breaking your text processing.  
- `timeout` is important when fetching from URLs to avoid long delays in a session.  