![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/reader/SparkNLP_Email_Reader_Demo.ipynb)

# Introducing Email reader in SparkNLP
This notebook showcases the newly added  `sparknlp.read().email()` method in Spark NLP that parses email content from both local file system and distributed file systems into a Spark DataFrame.

## Setup and Initialization
Let's keep in mind a few things before we start 😊

Support for reading email files was introduced in Spark NLP 5.5.2. Please make sure you have upgraded to the latest Spark NLP release.

- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [None]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

For local files example we will download a couple of email files from Spark NLP Github repo:

In [7]:
!mkdir email-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/email/email-text-attachments.eml -P email-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/email/test-several-attachments.eml -P email-files

--2025-03-06 00:20:35--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/email/email-text-attachments.eml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3175 (3.1K) [text/plain]
Saving to: ‘email-files/email-text-attachments.eml’


2025-03-06 00:20:35 (34.6 MB/s) - ‘email-files/email-text-attachments.eml’ saved [3175/3175]

--2025-03-06 00:20:35--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/email/test-several-attachments.eml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, await

In [8]:
!ls -lh ./email-files

total 1.3M
-rw-r--r-- 1 root root 3.2K Mar  6 00:20 email-text-attachments.eml
-rw-r--r-- 1 root root 1.3M Mar  6 00:20 test-several-attachments.eml


## Parsing Email from Local Files
Use the `email()` method to parse email content from local directories.

In [9]:
import sparknlp
email_df = sparknlp.read().email("./email-files")

email_df.select("email").show()

+--------------------+
|               email|
+--------------------+
|[{Title, Test Sev...|
|[{Title, Email Te...|
+--------------------+



In [10]:
email_df.printSchema()

root
 |-- path: string (nullable = true)
 |-- email: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- elementType: string (nullable = true)
 |    |    |-- content: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)



You can also use DFS like Databricks `dbfs://` or HDFS directories `hdfs://`

### Configuration Parameters

Let's add an email file for this example.

In [11]:
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/email/email-text-attachments.eml -P email-files

--2025-03-06 00:20:57--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/email/email-text-attachments.eml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3175 (3.1K) [text/plain]
Saving to: ‘email-files/email-text-attachments.eml.1’


2025-03-06 00:20:57 (27.2 MB/s) - ‘email-files/email-text-attachments.eml.1’ saved [3175/3175]



- `addAttachmentContent`: By default, this is set to `false`. When enabled, the output will include the content of attachments.

In [12]:
params = {"addAttachmentContent": "true"}
email_df = sparknlp.read(params).email("./email-files/email-text-attachments.eml")



In [13]:
from pyspark.sql.functions import explode, col

narrative_text_df = (
    email_df
    .select(
        explode(col("email")).alias("email_element")
    )
    .filter(col("email_element.elementType") == "NarrativeText")
    .select(
        col("email_element.elementType"),
        col("email_element.content")
    )
)

narrative_text_df.show(truncate=False)

+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

As you can see in the dataframe above the NarrativeText include the data from the attached text files.

In [14]:
import sparknlp
email_df = sparknlp.read().email("./email-files/email-text-attachments.eml")



In [15]:
from pyspark.sql.functions import explode, col

narrative_text_df = (
    email_df
    .select(
        explode(col("email")).alias("email_element")
    )
    .filter(col("email_element.elementType") == "NarrativeText")
    .select(
        col("email_element.elementType"),
        col("email_element.content")
    )
)

narrative_text_df.show(truncate=False)

+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

As you can see in the dataframe above the NarrativeText does not include the data from the attached text files.

- `storeContent`: By default, this is set to `false`. When enabled, the output will include the byte content of the file.

In [16]:
params = {"storeContent": "true"}
email_df = sparknlp.read(params).email("./email-files/email-text-attachments.eml")
email_df.show()

+--------------------+--------------------+--------------------+
|                path|               email|             content|
+--------------------+--------------------+--------------------+
|file:/content/ema...|[{Title, Email Te...|[46 72 6F 6D 3A 2...|
+--------------------+--------------------+--------------------+

