# Introducing Email reader in SparkNLP
This notebook showcases the newly added  `sparknlp.read().email()` method in Spark NLP that parses email content from both local file system and distributed file systems into a Spark DataFrame.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
!cp drive/MyDrive/JSL/sparknlp/sparknlp.jar .
!cp drive/MyDrive/JSL/sparknlp/spark_nlp-5.5.1-py2.py3-none-any.whl .

In [3]:
!pip install pyspark



In [9]:
!pip install spark_nlp-5.5.1-py2.py3-none-any.whl

Processing ./spark_nlp-5.5.1-py2.py3-none-any.whl
Installing collected packages: spark-nlp
Successfully installed spark-nlp-5.5.1


In [5]:
# import sparknlp
# # let's start Spark with Spark NLP
# spark = sparknlp.start()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SparkNLP") \
    .master("local[*]") \
    .config("spark.driver.memory", "12G") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.kryoserializer.buffer.max", "2000M") \
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.jars", "./sparknlp.jar") \
    .getOrCreate()


print("Apache Spark version: {}".format(spark.version))

Apache Spark version: 3.5.3


## Setup and Initialization
Let's keep in mind a few things before we start 😊

Support for reading email files was introduced in Spark NLP 5.5.2. Please make sure you have upgraded to the latest Spark NLP release.

For local files example we will download a couple of email files from Spark NLP Github repo:

In [17]:
!mkdir email-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1093-Adding-support-to-read-Email-files/src/test/resources/reader/email/email-text-attachments.eml -P email-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1093-Adding-support-to-read-Email-files/src/test/resources/reader/email/test-several-attachments.eml -P email-files

mkdir: cannot create directory ‘email-files’: File exists
--2024-11-13 21:01:15--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1093-Adding-support-to-read-Email-files/src/test/resources/reader/email/email-text-attachments.eml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3175 (3.1K) [text/plain]
Saving to: ‘email-files/email-text-attachments.eml’


2024-11-13 21:01:15 (29.9 MB/s) - ‘email-files/email-text-attachments.eml’ saved [3175/3175]

--2024-11-13 21:01:15--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1093-Adding-support-to-read-Email-files/src/test/resources/reader/email/test-several-attachments.eml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199

In [18]:
!ls -lh ./email-files

total 1.3M
-rw-r--r-- 1 root root 3.2K Nov 13 21:01 email-text-attachments.eml
-rw-r--r-- 1 root root 1.3M Nov 13 21:01 test-several-attachments.eml


## Parsing Email from Local Files
Use the `email()` method to parse email content from local directories.

In [22]:
import sparknlp
email_df = sparknlp.read().email("./email-files")

email_df.select("email").show()

+--------------------+
|               email|
+--------------------+
|[{Title, Email Te...|
|[{Title, Test Sev...|
+--------------------+



In [21]:
email_df.printSchema()

root
 |-- path: string (nullable = true)
 |-- content: binary (nullable = true)
 |-- email: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- elementType: string (nullable = true)
 |    |    |-- content: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)



You can also use DFS like Databricks `dbfs://` or HDFS directories `hdfs://`