# Introducing Reader2Doc in SparkNLP
This notebook showcases the newly added `Reader2Doc` annotator in Spark NLP
providing a streamlined and user-friendly interface for reading files. Useful for preprocessing data for NLP pipelines

In [6]:
import sparknlp
# let's start Spark with Spark NLP
spark = sparknlp.start()

print("Apache Spark version: {}".format(spark.version))

Apache Spark version: 3.5.1


## Setup and Initialization
Let's keep in mind a few things before we start 😊

Support for **Reader2Doc** annotator was introduced in Spark NLP 6.1.0. Please make sure you have upgraded to the latest Spark NLP release.

- Let's install and setup Spark NLP in Google Colab. This part is pretty easy via our simple script

In [7]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

The output of Reader2Doc uses the same Annotation schema as other Spark NLP annotators. This means you can seamlessly integrate it into any Spark NLP pipeline or process that expects annotated data.

In [8]:
from sparknlp.reader.reader2doc import Reader2Doc
from pyspark.ml import Pipeline

empty_df = spark.createDataFrame([], "string").toDF("text")

For local files example we will download different files from Spark NLP Github repo:

## Reading PDF Documents

**Downloading PDF files**

In [9]:
!mkdir pdf-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1259-Implement-Reader2Doc-Annotator/src/test/resources/reader/pdf/image_3_pages.pdf -P pdf-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1259-Implement-Reader2Doc-Annotator/src/test/resources/reader/pdf/pdf-title.pdf -P pdf-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1259-Implement-Reader2Doc-Annotator/src/test/resources/reader/pdf/text_3_pages.pdf -P pdf-files

--2025-07-20 23:50:48--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1259-Implement-Reader2Doc-Annotator/src/test/resources/reader/pdf/image_3_pages.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15629 (15K) [application/octet-stream]
Saving to: ‘pdf-files/image_3_pages.pdf’


2025-07-20 23:50:49 (13.4 MB/s) - ‘pdf-files/image_3_pages.pdf’ saved [15629/15629]

--2025-07-20 23:50:49--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1259-Implement-Reader2Doc-Annotator/src/test/resources/reader/pdf/pdf-title.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|1

In [10]:
reader2doc = Reader2Doc() \
    .setContentType("application/pdf") \
    .setContentPath("./pdf-files") \
    .setOutputCol("document")

pipeline = Pipeline(stages=[reader2doc])
model = pipeline.fit(empty_df)

result_df = model.transform(empty_df)
result_df.show()

+--------------------+
|            document|
+--------------------+
|[{document, 0, 14...|
|[{document, 15, 3...|
|[{document, 36, 5...|
|[{document, 0, 14...|
|[{document, 15, 3...|
|[{document, 39, 6...|
+--------------------+



## Reading HTML Documents

**Downloading HTML files**

In [11]:
!mkdir html-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1259-Implement-Reader2Doc-Annotator/src/test/resources/reader/html/example-10k.html -P html-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1259-Implement-Reader2Doc-Annotator/src/test/resources/reader/html/fake-html.html -P html-files

--2025-07-20 23:51:04--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1259-Implement-Reader2Doc-Annotator/src/test/resources/reader/html/example-10k.html
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2456707 (2.3M) [text/plain]
Saving to: ‘html-files/example-10k.html’


2025-07-20 23:51:04 (43.6 MB/s) - ‘html-files/example-10k.html’ saved [2456707/2456707]

--2025-07-20 23:51:04--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1259-Implement-Reader2Doc-Annotator/src/test/resources/reader/html/fake-html.html
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.19

In [12]:
reader2doc = Reader2Doc() \
    .setContentType("text/html") \
    .setContentPath("./html-files") \
    .setOutputCol("document")

pipeline = Pipeline(stages=[reader2doc])
model = pipeline.fit(empty_df)

result_df = model.transform(empty_df)
result_df.show()

+--------------------+
|            document|
+--------------------+
|[{document, 0, 12...|
|[{document, 13, 4...|
|[{document, 47, 6...|
|[{document, 69, 7...|
|[{document, 78, 1...|
|[{document, 164, ...|
|[{document, 207, ...|
|[{document, 297, ...|
|[{document, 330, ...|
|[{document, 363, ...|
|[{document, 382, ...|
|[{document, 447, ...|
|[{document, 702, ...|
|[{document, 755, ...|
|[{document, 862, ...|
|[{document, 992, ...|
|[{document, 1127,...|
|[{document, 1481,...|
|[{document, 1796,...|
|[{document, 2143,...|
+--------------------+
only showing top 20 rows



## Reading MS Office Documents

### Reading Word Files

**Downloading Word files**

In [13]:
!mkdir word-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1259-Implement-Reader2Doc-Annotator/src/test/resources/reader/doc/contains-pictures.docx -P word-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1259-Implement-Reader2Doc-Annotator/src/test/resources/reader/doc/fake_table.docx -P word-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1259-Implement-Reader2Doc-Annotator/src/test/resources/reader/doc/page-breaks.docx -P word-files

--2025-07-20 23:51:07--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1259-Implement-Reader2Doc-Annotator/src/test/resources/reader/doc/contains-pictures.docx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 95087 (93K) [application/octet-stream]
Saving to: ‘word-files/contains-pictures.docx’


2025-07-20 23:51:07 (4.77 MB/s) - ‘word-files/contains-pictures.docx’ saved [95087/95087]

--2025-07-20 23:51:07--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1259-Implement-Reader2Doc-Annotator/src/test/resources/reader/doc/fake_table.docx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githu

In [14]:
reader2doc = Reader2Doc() \
    .setContentType("application/msword") \
    .setContentPath("./word-files") \
    .setOutputCol("document")

pipeline = Pipeline(stages=[reader2doc])
model = pipeline.fit(empty_df)

result_df = model.transform(empty_df)
result_df.show()

+--------------------+
|            document|
+--------------------+
|[{document, 0, 31...|
|[{document, 32, 4...|
|[{document, 430, ...|
|[{document, 504, ...|
|[{document, 586, ...|
|[{document, 0, 11...|
|[{document, 114, ...|
|[{document, 263, ...|
|[{document, 294, ...|
|[{document, 325, ...|
|[{document, 354, ...|
|[{document, 411, ...|
|[{document, 0, 11...|
|[{document, 12, 2...|
|[{document, 24, 3...|
|[{document, 35, 4...|
|[{document, 49, 6...|
+--------------------+



### Reading PowerPoint Files

**Downloading PowerPoint files**

In [15]:
!mkdir ppt-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1259-Implement-Reader2Doc-Annotator/src/test/resources/reader/ppt/fake-power-point.pptx -P ppt-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1259-Implement-Reader2Doc-Annotator/src/test/resources/reader/ppt/fake-power-point-table.pptx -P ppt-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1259-Implement-Reader2Doc-Annotator/src/test/resources/reader/ppt/speaker-notes.pptx -P ppt-files

--2025-07-20 23:51:11--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1259-Implement-Reader2Doc-Annotator/src/test/resources/reader/ppt/fake-power-point.pptx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 38412 (38K) [application/octet-stream]
Saving to: ‘ppt-files/fake-power-point.pptx’


2025-07-20 23:51:11 (4.88 MB/s) - ‘ppt-files/fake-power-point.pptx’ saved [38412/38412]

--2025-07-20 23:51:11--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1259-Implement-Reader2Doc-Annotator/src/test/resources/reader/ppt/fake-power-point-table.pptx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (ra

In [16]:
reader2doc = Reader2Doc() \
    .setContentType("application/vnd.ms-powerpoint") \
    .setContentPath("./ppt-files") \
    .setOutputCol("document")

pipeline = Pipeline(stages=[reader2doc])
model = pipeline.fit(empty_df)

result_df = model.transform(empty_df)
result_df.show()

+--------------------+
|            document|
+--------------------+
|[{document, 0, 20...|
|[{document, 21, 5...|
|[{document, 51, 8...|
|[{document, 89, 1...|
|[{document, 144, ...|
|[{document, 166, ...|
|[{document, 0, 20...|
|[{document, 21, 5...|
|[{document, 51, 8...|
|[{document, 89, 1...|
|[{document, 144, ...|
|[{document, 166, ...|
|[{document, 0, 19...|
|[{document, 20, 2...|
|[{document, 28, 3...|
|[{document, 36, 4...|
|[{document, 44, 4...|
|[{document, 47, 5...|
|[{document, 52, 5...|
|[{document, 56, 6...|
+--------------------+
only showing top 20 rows



### Reading Excel Files

**Downloading Excel files**

In [17]:
!mkdir excel-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/xls/vodafone.xlsx -P excel-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/xls/2023-half-year-analyses-by-segment.xlsx -P excel-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/xls/page-break-example.xlsx -P excel-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/xls/xlsx-subtable-cases.xlsx -P excel-files

--2025-07-20 23:51:15--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/xls/vodafone.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12541 (12K) [application/octet-stream]
Saving to: ‘excel-files/vodafone.xlsx’


2025-07-20 23:51:15 (18.2 MB/s) - ‘excel-files/vodafone.xlsx’ saved [12541/12541]

--2025-07-20 23:51:15--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/xls/2023-half-year-analyses-by-segment.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 

In [18]:
reader2doc = Reader2Doc() \
    .setContentType("application/vnd.ms-excel") \
    .setContentPath("./excel-files") \
    .setOutputCol("document")

pipeline = Pipeline(stages=[reader2doc])
model = pipeline.fit(empty_df)

result_df = model.transform(empty_df)
result_df.show()

+--------------------+
|            document|
+--------------------+
|[{document, 0, 21...|
|[{document, 22, 4...|
|[{document, 44, 6...|
|[{document, 63, 1...|
|[{document, 107, ...|
|[{document, 339, ...|
|[{document, 395, ...|
|[{document, 452, ...|
|[{document, 508, ...|
|[{document, 566, ...|
|[{document, 615, ...|
|[{document, 682, ...|
|[{document, 734, ...|
|[{document, 793, ...|
|[{document, 858, ...|
|[{document, 949, ...|
|[{document, 993, ...|
|[{document, 1225,...|
|[{document, 1282,...|
|[{document, 1339,...|
+--------------------+
only showing top 20 rows



## Reading Text Documents

**Downloading Text files**

In [19]:
!mkdir txt-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/txt/simple-text.txt -P txt-files

--2025-07-20 23:51:19--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/txt/simple-text.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 300 [text/plain]
Saving to: ‘txt-files/simple-text.txt’


2025-07-20 23:51:19 (11.3 MB/s) - ‘txt-files/simple-text.txt’ saved [300/300]



In [20]:
reader2doc = Reader2Doc() \
    .setContentType("text/plain") \
    .setContentPath("./txt-files") \
    .setOutputCol("document")

pipeline = Pipeline(stages=[reader2doc])
model = pipeline.fit(empty_df)

result_df = model.transform(empty_df)
result_df.show()

+--------------------+
|            document|
+--------------------+
|[{document, 0, 17...|
|[{document, 18, 1...|
|[{document, 145, ...|
|[{document, 161, ...|
+--------------------+



## Reading XML Documents

**Downloading XML files**

In [21]:
!mkdir xml-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/xml/multi-level.xml -P xml-files

--2025-07-20 23:51:20--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/xml/multi-level.xml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 538 [text/plain]
Saving to: ‘xml-files/multi-level.xml’


2025-07-20 23:51:20 (26.1 MB/s) - ‘xml-files/multi-level.xml’ saved [538/538]



In [22]:
reader2doc = Reader2Doc() \
    .setContentType("application/xml") \
    .setContentPath("./xml-files") \
    .setOutputCol("document")

pipeline = Pipeline(stages=[reader2doc])
model = pipeline.fit(empty_df)

result_df = model.transform(empty_df)
result_df.show()

+--------------------+
|            document|
+--------------------+
|[{document, 0, 12...|
|[{document, 13, 2...|
|[{document, 25, 2...|
|[{document, 29, 5...|
|[{document, 52, 6...|
|[{document, 67, 7...|
+--------------------+



## Reading Mardown Documents

In [23]:
!mkdir md-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1213-Adding-MarkdownReader/src/test/resources/reader/md/simple.md -P md-files

--2025-07-20 23:51:21--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1213-Adding-MarkdownReader/src/test/resources/reader/md/simple.md
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 184 [text/plain]
Saving to: ‘md-files/simple.md’


2025-07-20 23:51:21 (2.67 MB/s) - ‘md-files/simple.md’ saved [184/184]



In [24]:
reader2doc = Reader2Doc() \
    .setContentType("text/markdown") \
    .setContentPath("./md-files") \
    .setOutputCol("document")

pipeline = Pipeline(stages=[reader2doc])
model = pipeline.fit(empty_df)

result_df = model.transform(empty_df)
result_df.show()

+--------------------+
|            document|
+--------------------+
|[{document, 0, 11...|
|[{document, 12, 7...|
|[{document, 80, 8...|
|[{document, 88, 1...|
|[{document, 102, ...|
|[{document, 115, ...|
|[{document, 129, ...|
+--------------------+



## Reading Email Documents

**Downloading Email files**

In [25]:
!mkdir email-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/email/email-text-attachments.eml -P email-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/email/test-several-attachments.eml -P email-files

--2025-07-20 23:51:22--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/email/email-text-attachments.eml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3175 (3.1K) [text/plain]
Saving to: ‘email-files/email-text-attachments.eml’


2025-07-20 23:51:22 (35.7 MB/s) - ‘email-files/email-text-attachments.eml’ saved [3175/3175]

--2025-07-20 23:51:22--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/email/test-several-attachments.eml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, await

In [26]:
reader2doc = Reader2Doc() \
    .setContentType("message/rfc822") \
    .setContentPath("./email-files") \
    .setOutputCol("document")

pipeline = Pipeline(stages=[reader2doc])
model = pipeline.fit(empty_df)

result_df = model.transform(empty_df)
result_df.show()

+--------------------+
|            document|
+--------------------+
|[{document, 0, 23...|
|[{document, 24, 1...|
|[{document, 162, ...|
|[{document, 1419,...|
|[{document, 1431,...|
|[{document, 1456,...|
|[{document, 0, 21...|
|[{document, 22, 7...|
|[{document, 74, 1...|
|[{document, 1045,...|
|[{document, 1057,...|
+--------------------+



## Parameters

We can output one document per row by setting `explodeDocs` to `false`

In [27]:
reader2doc = Reader2Doc() \
    .setContentType("message/rfc822") \
    .setContentPath("./email-files") \
    .setOutputCol("document") \
    .setExplodeDocs(False)

pipeline = Pipeline(stages=[reader2doc])
model = pipeline.fit(empty_df)

result_df = model.transform(empty_df)
result_df.show()

+--------------------+
|            document|
+--------------------+
|[{document, 0, 23...|
|[{document, 0, 21...|
+--------------------+



We can output plain text with minimal metadata by setting `flattentOutput` to true

In [28]:
reader2doc = Reader2Doc() \
    .setContentType("text/html") \
    .setContentPath("./html-files") \
    .setOutputCol("document") \
    .setFlattenOutput(True)

pipeline = Pipeline(stages=[reader2doc])
model = pipeline.fit(empty_df)

result_df = model.transform(empty_df)
result_df.show()

+--------------------+
|            document|
+--------------------+
|[{document, 0, 12...|
|[{document, 13, 4...|
|[{document, 47, 6...|
|[{document, 69, 7...|
|[{document, 78, 1...|
|[{document, 164, ...|
|[{document, 207, ...|
|[{document, 297, ...|
|[{document, 330, ...|
|[{document, 363, ...|
|[{document, 382, ...|
|[{document, 447, ...|
|[{document, 702, ...|
|[{document, 755, ...|
|[{document, 862, ...|
|[{document, 992, ...|
|[{document, 1127,...|
|[{document, 1481,...|
|[{document, 1796,...|
|[{document, 2143,...|
+--------------------+
only showing top 20 rows



## Pipeline Integration

We can integrate with pipelines. For example, with a simple `Tokenizer`:

In [29]:
from sparknlp.annotator import *
from sparknlp.base import *

empty_df = spark.createDataFrame([], "string").toDF("text")

regex_tok = RegexTokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("regex_token")

pipeline = Pipeline(stages=[reader2doc, regex_tok])
model = pipeline.fit(empty_df)

result_df = model.transform(empty_df)

In [30]:
result_df.show()

+--------------------+--------------------+
|            document|         regex_token|
+--------------------+--------------------+
|[{document, 0, 12...|[{token, 0, 5, UN...|
|[{document, 13, 4...|[{token, 13, 22, ...|
|[{document, 47, 6...|[{token, 47, 57, ...|
|[{document, 69, 7...|[{token, 69, 72, ...|
|[{document, 78, 1...|[{token, 78, 78, ...|
|[{document, 164, ...|[{token, 164, 166...|
|[{document, 207, ...|[{token, 207, 207...|
|[{document, 297, ...|[{token, 297, 299...|
|[{document, 330, ...|[{token, 330, 339...|
|[{document, 363, ...|[{token, 363, 368...|
|[{document, 382, ...|[{token, 382, 387...|
|[{document, 447, ...|[{token, 447, 452...|
|[{document, 702, ...|[{token, 702, 711...|
|[{document, 755, ...|[{token, 755, 759...|
|[{document, 862, ...|[{token, 862, 869...|
|[{document, 992, ...|[{token, 992, 999...|
|[{document, 1127,...|[{token, 1127, 11...|
|[{document, 1481,...|[{token, 1481, 14...|
|[{document, 1796,...|[{token, 1796, 18...|
|[{document, 2143,...|[{token, 2