![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/reader/SparkNLP_HTML_Reader_Demo.ipynb)

# Introducing HTML reader in SparkNLP
This notebook showcases the newly added  `sparknlp.read().html()` method in Spark NLP that parses HTML content from both local files and real-time URLs into a Spark DataFrame.

**Key Features:**
- Ability to parse HTML from local directories and URLs.
- Versatile support for varied data ingestion scenarios.

## Setup and Initialization
Let's keep in mind a few things before we start 😊

Support for reading html files was introduced in `Spark NLP 5.5.2`. Please make sure you have upgraded to the latest Spark NLP release.

- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [None]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

For local files example we will download a couple of HTML files from Spark NLP Github repo:

In [6]:
!mkdir html-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/html/example-10k.html -P html-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/html/fake-html.html -P html-files

--2024-11-05 20:02:19--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1089-Support-more-file-types-in-SparkNLP/src/test/resources/reader/html/example-10k.html
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2456707 (2.3M) [text/plain]
Saving to: ‘html-files/example-10k.html’


2024-11-05 20:02:19 (157 MB/s) - ‘html-files/example-10k.html’ saved [2456707/2456707]

--2024-11-05 20:02:20--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1089-Support-more-file-types-in-SparkNLP/src/test/resources/reader/html/fake-html.html
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.co

## Parsing HTML from Local Files
Use the `html()` method to parse HTML content from local directories.

In [7]:
import sparknlp
html_df = sparknlp.read().html("./html-files")

html_df.show()

+--------------------+--------------------+--------------------+
|                path|             content|                html|
+--------------------+--------------------+--------------------+
|file:/content/htm...|<!DOCTYPE html>\n...|[{Title, 0, My Fi...|
|file:/content/htm...|<?xml  version="1...|[{Title, 0, UNITE...|
+--------------------+--------------------+--------------------+



You can also use DFS file systems like:
- Databricks: `dbfs://`
- HDFS: `hdfs://`
- Microsoft Fabric OneLake: `abfss://`

## Parsing HTML from Real-Time URLs
Use the `html()` method to fetch and parse HTML content from a URL or a set of URLs in real time.

In [8]:
html_df = sparknlp.read().html("https://example.com/")
html_df.select("html").show()

+--------------------+
|                html|
+--------------------+
|[{Title, 0, Examp...|
+--------------------+



In [9]:
htmls_df = sparknlp.read().html(["https://www.wikipedia.org", "https://example.com/"])
htmls_df.show()

+--------------------+--------------------+
|                 url|                html|
+--------------------+--------------------+
|https://www.wikip...|[{Title, 0, Wikip...|
|https://example.com/|[{Title, 0, Examp...|
+--------------------+--------------------+



### Configuration Parameters

You can customize the font size used to identify paragraphs that should be treated as titles. By default, the font size is set to 16. However, if your HTML files require a different configuration, you can adjust this parameter accordingly. The example below demonstrates how to modify and work with this setting:

In [12]:
params = {"titleFontSize": "12"}
html_df = sparknlp.read(params).html("./html-files/fake-html.html")
html_df.select("html").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|html                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+-------------------------------------------------------------------------------------------------------------------------------------------------