# Introducing Markdown reader in SparkNLP
This notebook showcases the newly added  `sparknlp.read().md()` method in Spark NLP that parses Markdown content from both local files and real-time URLs into a Spark DataFrame.

**Key Features:**
- Ability to parse Markdown from local directories and URLs.
- Versatile support for varied data ingestion scenarios.

## Setup and Initialization
Let's keep in mind a few things before we start 😊

Support for reading xml files was introduced in Spark NLP 6.0.5. Please make sure you have upgraded to the latest Spark NLP release.

- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [6]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

For local files example we will download a couple of XML files from Spark NLP Github repo:

In [7]:
!mkdir md-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1213-Adding-MarkdownReader/src/test/resources/reader/md/simple.md -P md-files

--2025-07-08 21:52:18--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1213-Adding-MarkdownReader/src/test/resources/reader/md/simple.md
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 184 [text/plain]
Saving to: ‘md-files/simple.md’


2025-07-08 21:52:18 (2.25 MB/s) - ‘md-files/simple.md’ saved [184/184]



## Parsing Markdown from Local Files
Use the `md()` method to parse Markdown content from local directories.

In [8]:
import sparknlp
md_df = sparknlp.read().md("./md-files/")

md_df.show()

+--------------------+--------------------+
|              source|                  md|
+--------------------+--------------------+
|file:/content/md-...|[{Title, Introduc...|
+--------------------+--------------------+



In [9]:
md_df.printSchema()

root
 |-- source: string (nullable = true)
 |-- md: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- elementType: string (nullable = true)
 |    |    |-- content: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)



### Available Inputs

In [10]:
md_df = sparknlp.read().md(url="https://raw.githubusercontent.com/adamschwartz/github-markdown-kitchen-sink/master/README.md")
md_df.show()

+--------------------+--------------------+
|              source|                  md|
+--------------------+--------------------+
|https://raw.githu...|[{Title, GitHub M...|
+--------------------+--------------------+



In [11]:
content = """\
# Shopping List
 - Milk
 - Bread
 - Eggs
"""

md_df = sparknlp.read().md(text=content)
md_df.show()

+---------+--------------------+
|   source|                  md|
+---------+--------------------+
|in-memory|[{Title, Shopping...|
+---------+--------------------+

