# Introducing Reader2Table in SparkNLP
This notebook showcases the newly added `Reader2Table` annotator in Spark NLP, providing a streamlined and user-friendly interface for reading files. It is useful for preprocessing data for NLP pipelines that rely on information inside tables, enhanced by structuring the data as JSON or HTML.

In [6]:
import sparknlp
# let's start Spark with Spark NLP
spark = sparknlp.start()


print("Apache Spark version: {}".format(spark.version))

Apache Spark version: 3.5.1


## Setup and Initialization
Let's keep in mind a few things before we start 😊

Support for **Reader2Table** was introduced in Spark NLP 6.1.1 Please make sure you have upgraded to the latest Spark NLP release.

- Let's install and setup Spark NLP in Google Colab. This part is pretty easy via our simple script

In [7]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

The output of `Reader2Table` uses the same Annotation schema as other Spark NLP annotators. This means you can seamlessly integrate it into any Spark NLP pipeline or process that expects annotated data.

In [8]:
from sparknlp.reader.reader2table import Reader2Table
from pyspark.ml import Pipeline

empty_df = spark.createDataFrame([], "string").toDF("text")

For local files example we will download different files from Spark NLP Github repo:

## Reading HTML Documents

**Downloading HTML files**

In [9]:
!mkdir html-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/html/example-caption-th.html -P html-files

--2025-07-28 14:54:58--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1260-Implement-Reader2Table-Annotator/src/test/resources/reader/html/example-caption-th.html
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 479 [text/plain]
Saving to: ‘html-files/example-caption-th.html’


2025-07-28 14:54:58 (33.4 MB/s) - ‘html-files/example-caption-th.html’ saved [479/479]



In [10]:
reader2table = Reader2Table() \
    .setContentType("text/html") \
    .setContentPath("./html-files") \
    .setOutputCol("document")

pipeline = Pipeline(stages=[reader2table])
model = pipeline.fit(empty_df)

result_df = model.transform(empty_df)
result_df.show(truncate=False)

+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|fileName               |document                                                                                                                                                                                               |
+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|example-caption-th.html|[{document, 0, 116, {"caption":"Student Grades","header":["Name","Subject","Grade"],"rows":[["Alice","Math","A"],["Bob","Science","B+"]]}, {pageNumber -> 1, sentence -> 0, elementType -> Table}, []}]|
+-----------------------+-----------------------------------------------------------------------

## Reading MS Office Documents

### Reading Word Files

**Downloading Word files**

In [11]:
!mkdir word-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/doc/contains-pictures.docx -P word-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/doc/fake_table.docx -P word-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/doc/page-breaks.docx -P word-files

--2025-07-28 14:55:09--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1259-Implement-Reader2Doc-Annotator/src/test/resources/reader/doc/contains-pictures.docx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 95087 (93K) [application/octet-stream]
Saving to: ‘word-files/contains-pictures.docx’


2025-07-28 14:55:09 (3.62 MB/s) - ‘word-files/contains-pictures.docx’ saved [95087/95087]

--2025-07-28 14:55:10--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1259-Implement-Reader2Doc-Annotator/src/test/resources/reader/doc/fake_table.docx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githu

In [12]:
reader2table = Reader2Table() \
    .setContentType("application/msword") \
    .setContentPath("./word-files") \
    .setOutputCol("document")

pipeline = Pipeline(stages=[reader2table])
model = pipeline.fit(empty_df)

result_df = model.transform(empty_df)
result_df.show(truncate=False)

+----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+
|fileName              |document                                                                                                                                         |
+----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------+
|page-breaks.docx      |[]                                                                                                                                               |
|contains-pictures.docx|[]                                                                                                                                               |
|fake_table.docx       |[{document, 0, 96, {"caption":"","header":["Header Col 1","Header Col 2"],"rows":[["Lorem ipsum","A Link example"]]}, {el

### Reading PowerPoint Files

**Downloading PowerPoint files**

In [13]:
!mkdir ppt-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/ppt/fake-power-point.pptx -P ppt-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/ppt/fake-power-point-table.pptx -P ppt-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/ppt/speaker-notes.pptx -P ppt-files

--2025-07-28 14:55:13--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/ppt/fake-power-point.pptx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 38412 (38K) [application/octet-stream]
Saving to: ‘ppt-files/fake-power-point.pptx’


2025-07-28 14:55:14 (3.20 MB/s) - ‘ppt-files/fake-power-point.pptx’ saved [38412/38412]

--2025-07-28 14:55:14--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/ppt/fake-power-point-table.pptx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting resp

In [14]:
reader2table = Reader2Table() \
    .setContentType("application/vnd.ms-powerpoint") \
    .setContentPath("./ppt-files") \
    .setOutputCol("document")

pipeline = Pipeline(stages=[reader2table])
model = pipeline.fit(empty_df)

result_df = model.transform(empty_df)
result_df.show(truncate=False)

+---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|fileName                   |document                                                                                                                                                                              |
+---------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|speaker-notes.pptx         |[]                                                                                                                                                                                    |
|fake-power-point-table.pptx|[{document, 0, 114, {"caption":"","header":[],"rows":[["Red","Green","Blue"],["Purple","Orange","Yellow"],["Tangerine",

### Reading Excel Files

**Downloading Excel files**

In [15]:
!mkdir excel-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/xls/vodafone.xlsx -P excel-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/xls/2023-half-year-analyses-by-segment.xlsx -P excel-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/xls/page-break-example.xlsx -P excel-files

--2025-07-28 14:55:16--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/xls/vodafone.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12541 (12K) [application/octet-stream]
Saving to: ‘excel-files/vodafone.xlsx’


2025-07-28 14:55:16 (19.4 MB/s) - ‘excel-files/vodafone.xlsx’ saved [12541/12541]

--2025-07-28 14:55:17--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/xls/2023-half-year-analyses-by-segment.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 

In [16]:
reader2table = Reader2Table() \
    .setContentType("application/vnd.ms-excel") \
    .setContentPath("./excel-files") \
    .setOutputCol("document")

pipeline = Pipeline(stages=[reader2table])
model = pipeline.fit(empty_df)

result_df = model.transform(empty_df)
result_df.show(truncate=False)

+---------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Reading Mardown Documents

In [17]:
!mkdir md-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/md/simple-table.md -P md-files

--2025-07-28 14:55:19--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1260-Implement-Reader2Table-Annotator/src/test/resources/reader/md/simple-table.md
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 82 [text/plain]
Saving to: ‘md-files/simple-table.md’


2025-07-28 14:55:19 (3.63 MB/s) - ‘md-files/simple-table.md’ saved [82/82]



In [18]:
reader2table = Reader2Table() \
    .setContentType("text/markdown") \
    .setContentPath("./md-files") \
    .setOutputCol("document")

pipeline = Pipeline(stages=[reader2table])
model = pipeline.fit(empty_df)

result_df = model.transform(empty_df)
result_df.show(truncate=False)

+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|fileName       |document                                                                                                                                                                                             |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|simple-table.md|[{document, 0, 114, {"caption":"","header":["Item","Price","# In stock"],"rows":[["Juicy Apples","1.99","739"],["Bananas","1.89","6"]]}, {pageNumber -> 1, sentence -> 0, elementType -> Table}, []}]|
+---------------+-----------------------------------------------------------------------------------------------------------------------

## Reading CSV Documents

In [19]:
!mkdir csv-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/csv/stanley-cups.csv -P csv-files

--2025-07-28 14:55:21--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1260-Implement-Reader2Table-Annotator/src/test/resources/reader/csv/stanley-cups.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 91 [text/plain]
Saving to: ‘csv-files/stanley-cups.csv’


2025-07-28 14:55:21 (1.11 MB/s) - ‘csv-files/stanley-cups.csv’ saved [91/91]



In [20]:
reader2table = Reader2Table() \
    .setContentType("text/csv") \
    .setContentPath("./csv-files") \
    .setOutputCol("document")

pipeline = Pipeline(stages=[reader2table])
model = pipeline.fit(empty_df)

result_df = model.transform(empty_df)
result_df.show(truncate=False)

+----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|fileName        |document                                                                                                                                                                                    |
+----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|stanley-cups.csv|[{document, 0, 137, {"caption":"","header":[],"rows":[["Team","Location","Stanley Cups"],["Blues","STL","1"],["Flyers","PHI","2"],["Maple Leafs","TOR","13"]]}, {elementType -> Table}, []}]|
+----------------+------------------------------------------------------------------------------------------------------------------------------------------------------

## Parameters

We can explode the output by setting `explodeDocs` to `true`

In [21]:
reader2table = Reader2Table() \
    .setContentType("application/vnd.ms-excel") \
    .setContentPath("./excel-files") \
    .setExplodeDocs(True) \
    .setOutputCol("document")

pipeline = Pipeline(stages=[reader2table])
model = pipeline.fit(empty_df)

result_df = model.transform(empty_df)
result_df.show(truncate=False)

+---------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

We can set the output as an HTML representation of the table, setting `outputFormat` to `html-table`

In [22]:
reader2table = Reader2Table() \
    .setContentType("text/csv") \
    .setContentPath("./csv-files") \
    .setOutputFormat("html-table") \
    .setOutputCol("document")

pipeline = Pipeline(stages=[reader2table])
model = pipeline.fit(empty_df)

result_df = model.transform(empty_df)
result_df.show(truncate=False)

+----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|fileName        |document                                                                                                                                                                                                                                                                                                                    |
+----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Pipeline Integration

We can integrate with pipelines. For example, with a simple `Tokenizer`:

In [23]:
from sparknlp.annotator import *
from sparknlp.base import *

empty_df = spark.createDataFrame([], "string").toDF("text")

regex_tok = RegexTokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("regex_token")

pipeline = Pipeline(stages=[reader2table, regex_tok])
model = pipeline.fit(empty_df)

result_df = model.transform(empty_df)

In [24]:
result_df.show()

+----------------+--------------------+--------------------+
|        fileName|            document|         regex_token|
+----------------+--------------------+--------------------+
|stanley-cups.csv|[{document, 0, 26...|[{token, 0, 21, <...|
+----------------+--------------------+--------------------+

