![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/reader/SparkNLP_Email_Data_Preparation.ipynb)

# Data Preparation with SparkNLP
This notebook demonstrates how to leverage the new `read()` component in Spark NLP alongside the `Cleaner` or `Extractor` annotators to efficiently preprocess your data before feeding it into an NLP model.

Incorporating this preprocessing step into your pipeline is highly recommended, as it can significantly enhance the quality and performance of your NLP model.

## Setup and Initialization
Let's keep in mind a few things before we start 😊

Support for reading email files was introduced in Spark NLP 5.5.2, while `Cleaner` and `Extractor` annotators was introduced in Spark NLP 6.0.0.
Please make sure you have upgraded to the latest Spark NLP release.

- Let's install and setup Spark NLP in Google Colab
- This part is pretty easy via our simple script

In [None]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

### Additional Configuration for Databricks

When running on Databricks, it is necessary to include the following Spark configurations to avoid dependency conflicts:

- `spark.driver.userClassPathFirst true`
- `spark.executor.userClassPathFirst true`

These configurations are required because the Databricks runtime environment includes a bundled version of the `com.sun.mail:jakarta.mail` library, which conflicts with `jakarta.activation`. By setting these properties, the application ensures that the user-provided libraries take precedence over those bundled in the Databricks environment, resolving the dependency conflict.

For local files example we will download a couple of email files from Spark NLP Github repo:

In [None]:
!mkdir email-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1093-Adding-support-to-read-Email-files/src/test/resources/reader/email/email-text-attachments.eml -P email-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1093-Adding-support-to-read-Email-files/src/test/resources/reader/email/test-several-attachments.eml -P email-files

--2025-02-12 20:07:48--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1093-Adding-support-to-read-Email-files/src/test/resources/reader/email/email-text-attachments.eml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3175 (3.1K) [text/plain]
Saving to: ‘email-files/email-text-attachments.eml’


2025-02-12 20:07:48 (43.7 MB/s) - ‘email-files/email-text-attachments.eml’ saved [3175/3175]

--2025-02-12 20:07:48--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1093-Adding-support-to-read-Email-files/src/test/resources/reader/email/test-several-attachments.eml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubuse

In [None]:
!ls -lh ./email-files

total 1.3M
-rw-r--r-- 1 root root 3.2K Feb 12 20:07 email-text-attachments.eml
-rw-r--r-- 1 root root 1.3M Feb 12 20:07 test-several-attachments.eml


## Parsing Email from Local Files
Use the `email()` method to parse email content from local directories.

In [None]:
import sparknlp
email_df = sparknlp.read().email("./email-files")

email_df.select("email").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Let's check the schema for this Dataframe

In [None]:
email_df.printSchema()

root
 |-- path: string (nullable = true)
 |-- content: binary (nullable = true)
 |-- email: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- elementType: string (nullable = true)
 |    |    |-- content: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)



As seen in the schema and output, we have the email information along with metadata that can be used to filter and sanitize the data. Let's take a closer look at the metadata for this email data:

In [None]:
from pyspark.sql.functions import col, explode

email_matadata_df = email_df.withColumn("email_metadata", explode(col("email.metadata")))
email_matadata_df.select("email_metadata").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|email_exploded                                                                                                                                                                                                                    |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <danilo@johnsnowlabs.com>}                                                                                                                      |
|{sent_to -> Danilo Burbano <danilo@johnsnowlabs.com>, sent_from -> Danilo Burbano <

In this example, we are not interested in results containing HTML data, so we will focus only on plain text.

In [None]:
from pyspark.sql.functions import col, explode

#Filter out only NarrativeText elements and text/plain content from the email array
narrative_email_df = email_df.selectExpr(
    "path",
    "FILTER(email, x -> x.elementType = 'NarrativeText' AND x.metadata['mimeType'] = 'text/plain') AS narrative_email"
)

exploded_df = narrative_email_df.withColumn("email_exploded", explode(col("narrative_email")))

#Select only the content field from the exploded struct
email_content_df = exploded_df.select(
    "path",
    col("email_exploded.content").alias("narrative_text")
)

email_content_df.show(truncate=False)

+------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
|path                                                  |narrative_text                                                                                                                                      |
+------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
|file:/content/email-files/email-text-attachments.eml  |Email  test with two text attachments\r\n\r\nCheers,\r\n\r\n                                                                                        |
|file:/content/email-files/test-several-attachments.eml|This is only a test email with attachments to verify EmailReader feature in Spark NLP.\r\n\r\nYou don't need to reply to

Now, we can use `Cleaner` annotator to remove any remaining undesired characters from the data.

In [None]:
from sparknlp.base import *
from sparknlp.annotator.cleaners import *

document_assembler = DocumentAssembler() \
  .setInputCol("narrative_text") \
  .setOutputCol("document")

cleaner = Cleaner() \
    .setInputCols(["document"]) \
    .setOutputCol("cleaned") \
    .setCleanerMode("clean") \
    .setBullets(True) \
    .setExtraWhitespace(True) \
    .setDashes(True)

pipeline = Pipeline().setStages([
    document_assembler,
    cleaner
])

model = pipeline.fit(email_content_df)
clean_email_content_df = model.transform(email_content_df)
clean_email_content_df.select("cleaned").show(truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|cleaned                                                                                                                                                     |
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 0, 44, Email test with two text attachments Cheers,, {}, []}]                                                                                      |
|[{chunk, 0, 129, This is only a test email with attachments to verify EmailReader feature in Spark NLP. You don't need to reply to this message 🙂, {}, []}]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------+



Now, you have your enhanced text ready to feed into an NLP model for improved performance.