![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/reader/SparkNLP_Word_Reader_Demo.ipynb)

# Introducing Word reader in SparkNLP
This notebook showcases the newly added  `sparknlp.read().doc()` method in Spark NLP that parses Word documents content from both local and distributed file systems into a Spark DataFrame.

## Setup and Initialization
Let's keep in mind a few things before we start 😊

Support for reading Word files was introduced in Spark NLP 5.5.2. Please make sure you have upgraded to the latest Spark NLP release.

- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [None]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

For local files example we will download a couple of Word files from Spark NLP Github repo:

In [8]:
!mkdir word-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/doc/contains-pictures.docx -P word-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/doc/fake_table.docx -P word-files

--2024-12-11 02:43:35--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1094-Adding-support-to-read-Word-files-v2/src/test/resources/reader/doc/contains-pictures.docx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 95087 (93K) [application/octet-stream]
Saving to: ‘word-files/contains-pictures.docx’


2024-12-11 02:43:35 (2.47 MB/s) - ‘word-files/contains-pictures.docx’ saved [95087/95087]

--2024-12-11 02:43:36--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1094-Adding-support-to-read-Word-files-v2/src/test/resources/reader/doc/fake_table.docx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.co

In [9]:
!ls -lh ./word-files

total 112K
-rw-r--r-- 1 root root 93K Dec 11 02:43 contains-pictures.docx
-rw-r--r-- 1 root root 13K Dec 11 02:43 fake_table.docx


## Parsing Word document from Local Files
Use the `doc()` method to parse email content from local directories.

In [10]:
import sparknlp

doc_df = sparknlp.read().doc("./word-files")



In [12]:
doc_df.select("doc").show()

+--------------------+
|                 doc|
+--------------------+
|[{Table, Header C...|
|[{Header, An inli...|
+--------------------+



In [11]:
doc_df.printSchema()

root
 |-- path: string (nullable = true)
 |-- content: binary (nullable = true)
 |-- doc: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- elementType: string (nullable = true)
 |    |    |-- content: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)



You can also use DFS file systems like:
- Databricks: `dbfs://`
- HDFS: `hdfs://`
- Microsoft Fabric OneLake: `abfss://`