![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/reader/SparkNLP_Extractor_Demo.ipynb)

# Introducing Extractor in SparkNLP
This notebook showcases the newly added  `Extractor()` annotator in Spark NLP enabling seamless extraction of key information (e.g., dates, emails, IP addresses) from various data sources such as `.eml` files. This simplifies data parsing workflows by isolating relevant details automatically.

In [0]:
import sparknlp
# let's start Spark with Spark NLP
spark = sparknlp.start()
print("Apache Spark version: {}".format(spark.version))

Apache Spark version: 3.4.1


## Setup and Initialization
Let's keep in mind a few things before we start 😊

Support for reading html files was introduced in Spark NLP 6.0.0. Please make sure you have upgraded to the latest Spark NLP release.
We simple need to import the cleaners components to use `Extractor` annotator:

In [0]:
from sparknlp.annotator.cleaners import *

## Extracting data

Extracting information from eml data

In [0]:
eml_data = """from ABC.DEF.local ([ba23::58b5:2236:45g2:88h2]) by
  \n ABC.DEF.local2 ([ba23::58b5:2236:45g2:88h2%25]) with mapi id\
  n 32.88.5467.123; Fri, 26 Mar 2021 11:04:09 +1200"""

data_set = spark.createDataFrame([[eml_data]]).toDF("text")

Extracting date

In [0]:
from sparknlp.annotator import *
from sparknlp.base import *

document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

extractor = Extractor() \
    .setInputCols(["document"]) \
    .setOutputCol("date") \
    .setExtractorMode("email_date")

pipeline = Pipeline().setStages([
    document_assembler,
    extractor
])

model = pipeline.fit(data_set)
result = model.transform(data_set)
result.select("date").show(truncate=False)

+------------------------------------------------------------+
|date                                                        |
+------------------------------------------------------------+
|[{chunk, 136, 166, Fri, 26 Mar 2021 11:04:09 +1200, {}, []}]|
+------------------------------------------------------------+



Extracting email addresses

In [0]:
eml_data = [
    "Me me@email.com and You <You@email.com>\n  ([ba23::58b5:2236:45g2:88h2]) (10.0.2.01)",
    "Im Rabn <Im.Rabn@npf.gov.nr>"
]

data_set = spark.createDataFrame(eml_data, "string").toDF("text")

In [0]:
extractor = Extractor() \
    .setInputCols(["document"]) \
    .setOutputCol("email") \
    .setExtractorMode("email_address")

pipeline = Pipeline().setStages([
    document_assembler,
    extractor
])

model = pipeline.fit(data_set)
result = model.transform(data_set)
result.select("email").show(truncate=False)

+------------------------------------------------------------------------------+
|email                                                                         |
+------------------------------------------------------------------------------+
|[{chunk, 3, 14, me@email.com, {}, []}, {chunk, 25, 37, You@email.com, {}, []}]|
|[{chunk, 9, 26, Im.Rabn@npf.gov.nr, {}, []}]                                  |
+------------------------------------------------------------------------------+



Extracting IPv4 and IPv6 addresses

In [0]:
eml_data = [
    """from ABC.DEF.local ([ba23::58b5:2236:45g2:88h2]) by
    ABC.DEF.local ([68.183.71.12]) with mapi id
    32.88.5467.123; Fri, 26 Mar 2021 11:04:09 +1200"""
]

data_set = spark.createDataFrame(eml_data, "string").toDF("text")

In [0]:
extractor = Extractor() \
    .setInputCols(["document"]) \
    .setOutputCol("ip_address") \
    .setExtractorMode("ip_address")

pipeline = Pipeline().setStages([
    document_assembler,
    extractor
])

model = pipeline.fit(data_set)
result = model.transform(data_set)
result.select("ip_address").show(truncate=False)

+-------------------------------------------------------------------------------------------+
|ip_address                                                                                 |
+-------------------------------------------------------------------------------------------+
|[{chunk, 21, 45, ba23::58b5:2236:45g2:88h2, {}, []}, {chunk, 72, 83, 68.183.71.12, {}, []}]|
+-------------------------------------------------------------------------------------------+



Extracting MAPI IDs

In [0]:
eml_data = """from ABC.DEF.local ([ba23::58b5:2236:45g2:88h2]) by
  \n ABC.DEF.local2 ([ba23::58b5:2236:45g2:88h2%25]) with mapi id\
  n 32.88.5467.123; Fri, 26 Mar 2021 11:04:09 +1200"""

data_set = spark.createDataFrame([[eml_data]]).toDF("text")

In [0]:
extractor = Extractor() \
    .setInputCols(["document"]) \
    .setOutputCol("mapi_id") \
    .setExtractorMode("mapi_id")

pipeline = Pipeline().setStages([
    document_assembler,
    extractor
])

model = pipeline.fit(data_set)
result = model.transform(data_set)
result.select("mapi_id").show(truncate=False)

+-------------------------------------------+
|mapi_id                                    |
+-------------------------------------------+
|[{chunk, 120, 133, 32.88.5467.123, {}, []}]|
+-------------------------------------------+



Extracting US phone number

In [0]:
data = [
    "215-867-5309",
    "Phone Number: +1 215.867.5309",
    "Phone Number: Just Kidding"
]

test_df = spark.createDataFrame(data, "string").toDF("text")

In [0]:
extractor = Extractor() \
    .setInputCols(["document"]) \
    .setOutputCol("us_phones") \
    .setExtractorMode("us_phone_numbers")

pipeline = Pipeline().setStages([
    document_assembler,
    extractor
])

model = pipeline.fit(test_df)
result = model.transform(test_df)
result.select("us_phones").show(truncate=False)

+------------------------------------------+
|us_phones                                 |
+------------------------------------------+
|[{chunk, 0, 11, 215-867-5309, {}, []}]    |
|[{chunk, 14, 28, +1 215.867.5309, {}, []}]|
|[]                                        |
+------------------------------------------+



Extracting bullets from text

In [0]:
data = [
    "1. Introduction:",
    "a. Introduction:",
    "5.3.1 Convolutional Networks",
    "D.b.C Recurrent Neural Networks",
    "2.b.1 Recurrent Neural Networks",
    "bb.c Feed Forward Neural Networks",
    "Fig. 2: The relationship"
]

test_df = spark.createDataFrame(data, "string").toDF("text")

In [0]:
extractor = Extractor() \
    .setInputCols(["document"]) \
    .setOutputCol("bullets") \
    .setExtractorMode("bullets")

pipeline = Pipeline().setStages([
    document_assembler,
    extractor
])

model = pipeline.fit(test_df)
result = model.transform(test_df)
result.select("bullets").show(truncate=False)

+------------------------------------------------------------------------------------+
|bullets                                                                             |
+------------------------------------------------------------------------------------+
|[{chunk, 0, 2, (1,None,None), {section -> 1}, []}]                                  |
|[{chunk, 0, 2, (a,None,None), {section -> a}, []}]                                  |
|[{chunk, 0, 5, (5,3,1), {section -> 5, sub_section -> 3, sub_sub_section -> 1}, []}]|
|[{chunk, 0, 5, (D,b,C), {section -> D, sub_section -> b, sub_sub_section -> C}, []}]|
|[{chunk, 0, 5, (2,b,1), {section -> 2, sub_section -> b, sub_sub_section -> 1}, []}]|
|[{chunk, 0, 4, (bb,c,None), {section -> bb, sub_section -> c}, []}]                 |
|[{chunk, 0, 0, (None,None,None), {}, []}]                                           |
+------------------------------------------------------------------------------------+



Extract image from URLS

In [0]:
data = [
    "https://my-image.png with some text",
    "some text https://my-image.jpg with another http://my-image.bmp",
    "http://my-path/my%20image.JPG",
    """<img src="https://example.com/images/photo1.jpg" />
    <img src="https://example.org/assets/icon.png" />
    <link href="https://example.net/style.css" />"""
]

test_df = spark.createDataFrame(data, "string").toDF("text")

In [0]:
extractor = Extractor() \
    .setInputCols(["document"]) \
    .setOutputCol("image_urls") \
    .setExtractorMode("image_urls")

pipeline = Pipeline().setStages([
    document_assembler,
    extractor
])

model = pipeline.fit(test_df)
result = model.transform(test_df)
result.select("image_urls").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------+
|image_urls                                                                                                                     |
+-------------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 0, 19, https://my-image.png, {}, []}]                                                                                 |
|[{chunk, 10, 29, https://my-image.jpg, {}, []}, {chunk, 44, 62, http://my-image.bmp, {}, []}]                                  |
|[{chunk, 0, 28, http://my-path/my%20image.JPG, {}, []}]                                                                        |
|[{chunk, 10, 46, https://example.com/images/photo1.jpg, {}, []}, {chunk, 66, 100, https://example.org/assets/icon.png, {}, []}]|
+-----------------------------------------------------------------------------------------

Extract text after

In [0]:
data = ["SPEAKER 1: Look at me, I'm flying!"]

test_df = spark.createDataFrame(data, "string").toDF("text")

In [0]:
extractor = Extractor() \
    .setInputCols(["document"]) \
    .setOutputCol("text_after") \
    .setExtractorMode("text_after") \
    .setTextPattern("SPEAKER \\d{1}:")

pipeline = Pipeline().setStages([
    document_assembler,
    extractor
])

model = pipeline.fit(test_df)
result = model.transform(test_df)
result.select("text_after").show(truncate=False)

+------------------------------------------------------------+
|text_after                                                  |
+------------------------------------------------------------+
|[{chunk, 10, 34, Look at me, I'm flying!, {index -> 0}, []}]|
+------------------------------------------------------------+



Extract text before

In [0]:
data = ["Here I am! STOP Look at me! STOP I'm flying! STOP"]

test_df = spark.createDataFrame(data, "string").toDF("text")

In [0]:
extractor = Extractor() \
    .setInputCols(["document"]) \
    .setOutputCol("text_before") \
    .setExtractorMode("text_before") \
    .setTextPattern("STOP")

pipeline = Pipeline().setStages([
    document_assembler,
    extractor
])

model = pipeline.fit(test_df)
result = model.transform(test_df)
result.select("text_before").show(truncate=False)

+----------------------------------------------+
|text_before                                   |
+----------------------------------------------+
|[{chunk, 0, 11, Here I am!, {index -> 0}, []}]|
+----------------------------------------------+



## Custom Patterns

As you can see in the output of the example above. We have by default patterns to extract most common data. However, you can also set custom regex patterns to address your specific extraction needs.

In [0]:
eml_data = [
    """from ABC.DEF.local ([ba23::58b5:2236:45g2:88h2]) by
    ABC.DEF.local ([68.183.71.12]) with mapi id
    32.88.5467.123; Fri, 26 Mar 2021 11:04:09 +1200"""
]

data_set = spark.createDataFrame(eml_data, "string").toDF("text")

In [0]:
my_ipv4_regex = "(?:25[0-5]|2[0-4]\\d|1\\d{2}|[1-9]?\\d)(?:\\.(?:25[0-5]|2[0-4]\\d|1\\d{2}|[1-9]?\\d)){3}"
extractor = Extractor() \
    .setInputCols(["document"]) \
    .setOutputCol("ipv4_address") \
    .setExtractorMode("ip_address") \
    .setIpAddressPattern(my_ipv4_regex)

pipeline = Pipeline().setStages([
    document_assembler,
    extractor
])

model = pipeline.fit(data_set)
result = model.transform(data_set)
result.select("ipv4_address").show(truncate=False)

+---------------------------------------+
|ipv4_address                           |
+---------------------------------------+
|[{chunk, 72, 83, 68.183.71.12, {}, []}]|
+---------------------------------------+



Index in After and Before text

The `index` parameter tells the `Extractor` which occurrence of the specified `text pattern` should be used as the reference point for extracting text. For example:

In [0]:
data = ["Teacher: BLAH BLAH BLAH; Student: BLAH BLAH BLAH!"]

test_df = spark.createDataFrame(data, "string").toDF("text")

In [0]:
extractor = Extractor() \
    .setInputCols(["document"]) \
    .setOutputCol("text_before") \
    .setExtractorMode("text_before") \
    .setTextPattern("BLAH") \
    .setIndex(1)

pipeline = Pipeline().setStages([
    document_assembler,
    extractor
])

model = pipeline.fit(test_df)
result = model.transform(test_df)
result.select("text_before").show(truncate=False)

+-------------------------------------------------+
|text_before                                      |
+-------------------------------------------------+
|[{chunk, 0, 14, Teacher: BLAH, {index -> 1}, []}]|
+-------------------------------------------------+



In [0]:
extractor = Extractor() \
    .setInputCols(["document"]) \
    .setOutputCol("text_before") \
    .setExtractorMode("text_before") \
    .setTextPattern("BLAH") \

pipeline = Pipeline().setStages([
    document_assembler,
    extractor
])

model = pipeline.fit(test_df)
result = model.transform(test_df)
result.select("text_before").show(truncate=False)

+-------------------------------------------+
|text_before                                |
+-------------------------------------------+
|[{chunk, 0, 9, Teacher:, {index -> 0}, []}]|
+-------------------------------------------+

