# Introducing XML reader in SparkNLP
This notebook showcases the newly added  `sparknlp.read().xml()` method in Spark NLP that parses XML content from both local files and real-time URLs into a Spark DataFrame.

**Key Features:**
- Ability to parse XML from local directories and URLs.
- Versatile support for varied data ingestion scenarios.

## Setup and Initialization
Let's keep in mind a few things before we start 😊

Support for reading xml files was introduced in Spark NLP 6.1.0. Please make sure you have upgraded to the latest Spark NLP release.

- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [6]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

For local files example we will download a couple of XML files from Spark NLP Github repo:

In [7]:
!mkdir xml-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/xml/multi-level.xml -P xml-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/xml/test.xml -P xml-files

--2025-06-09 21:43:40--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1119-Implement-XML-Reader/src/test/resources/reader/xml/multi-level.xml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 538 [text/plain]
Saving to: ‘xml-files/multi-level.xml’


2025-06-09 21:43:40 (34.0 MB/s) - ‘xml-files/multi-level.xml’ saved [538/538]

--2025-06-09 21:43:40--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1119-Implement-XML-Reader/src/test/resources/reader/xml/test.xml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awai

## Parsing XML from Local Files
Use the `xml()` method to parse XML content from local directories.

In [8]:
import sparknlp
xml_df = sparknlp.read().xml("./xml-files")

xml_df.show()

+--------------------+--------------------+
|                path|                 xml|
+--------------------+--------------------+
|file:/content/xml...|[{Title, Harry Po...|
|file:/content/xml...|[{Title, The Alch...|
+--------------------+--------------------+



In [9]:
xml_df.printSchema()

root
 |-- path: string (nullable = true)
 |-- xml: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- elementType: string (nullable = true)
 |    |    |-- content: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)



### Configuration Parameters

`xmlKeepTags`: When true, includes the tag name of each XML element in the metadata under the key `tag`.

In [10]:
params = {"xmlKeepTags": "true"}
xml_df = sparknlp.read(params).xml("./xml-files")
xml_df.select("xml").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

`onlyLeafNodes`: When true, includes only leaf elements (i.e., elements with no child elements) in the output. When false, all elements (including containers) are included.

In [11]:
params = {"onlyLeafNodes": "false"}
xml_df = sparknlp.read(params).xml("./xml-files")
xml_df.select("xml").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

You can access the raw content of the file using the `storeContent` parameter

In [12]:
params = {"storeContent": "true"}
xml_df = sparknlp.read(params).xml("./xml-files")
xml_df.show(truncate=False)

+---------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------